Research : The Virtual Observatory

Introduction
A VO Primer
Get the software
VO links
E-science in other disciplines

Introduction.
Since 2001, I have been involved in the development of e-science, and specifically the creation of the Virtual Observatory (VO). What drove me into this was the realisation that the data volumes for surveys I was planning would be huge. There would be a bandwidth bottleneck making it hard for astronomers to capitalise on these surveys, and requiring them to be analysed at source rather than downloaded. At the same time, more and more astronomers wanted to compare multiple databases hosted on different continents. What we needed was a transparent Internet framework for exploring data : a Virtual Observatory. This however posed considerable technological and sociological challenges. A number of other people round the world were having similar thoughts, with Alex Szalay leading the way, and several VO projects started up, soon forming an International Virtual Observatory Alliance.

I have led two projects : the UK project AstroGrid, and the European project VOTECH. (See the Project Links section.) Both are now completed, but the software is still available. I remain a partner in a succession of Europe-wide VO projects.

A VO primer.
When you surf the web, and click from link to link all over the world, it just feels as if all those documents are right there in your PC. The aim of the VO is to get the same transparent feel for astronomical data. All the databases all over the world, and the tools needed to analyse the data, are right there. You just work.

So how do we bring about this vision ? Well, its not a question of warehousing all the worlds data in one giant data centre. That would never happen, politically, pragmatically, or financially. Neither is it a question of writing a monolithic piece of software, a giant "VO application". That would be incredibly hard to do, and it would be out of date and look clunky very quickly. The Virtual Observatory is a way of life. What we want is first and foremost is to agree standards - then people can write VO aware tools, and make data available as web services, in a modular way so that things connect. The key aim is automation; things we currently do by hand should be understood by applications.

So creating the VO needs four things : (i) standards, (ii) content (iii) tools, and (iv) glue software. Setting standards is not just about data formats, but about the meanings of columns in tables (what does "VMAG" mean ?), about how you specify what a service offers, about how you define sky coverage, and about how you define software interfaces, so that software modules can plug into each other. This is the bedrock of the VO. Putting content into the VO is about existing data centres choosing to make their datasets available as standardised services - following W3C standards such as WSDL and SOAP but also IVOA standards for describing services, and for exporting data in recognised formats. Such "services" could be just a list of images you can download, or it could be an interface that accepts a SQL squery to a database, or it could be something much more complicated and CPU intensive, like running Sextractor on an image, or calculating a correlation function. For the VO to really take off, we need the tools we write to be "VO aware" - for example a table analyser and plotter will know what it can do with a column labelled "VMAG", and an image viewer can browse collections of images from around the world. Other new types of tool can be things like a browser for discovering datasets, or a "workflow builder" for piecing together steps in a programmable way. Finally, for this all to work, you also need some glue software, implemented as "core services". For example, the availability of VO compatible data services and tools needs to be published in a registry which can be queried or browsed; we need to store and manipulate information on who's who and who's allowed to get at what, so we can have a "single sign on" system; workflow engines need to track and run long lived multi-step jobs; virtual storage systems are needed both to cache steps in such workflows, and to provide a shareable storage for users. The need for these "core services" doesn't mean we have some kind of "central VO control" - they can be offered separately and competitively, just like the datasets and tools. "Use my registry, its got twice as much detail".

Get the software.
There is a nice collection of tools linked at the Euro-VO website software page. The two favourite VO tools are Topcat (tables) and Aladin (images). The AstroGrid software is mostly technical infrastructure, and has been absorbed into other projects (for example, it runs the AstroGrid registry which Topcat uses). Other than Topcat, the tool most users still use is VO Desktop, which you can still find at the old AstroGrid web pages.

VO links.
Here are some more VO links

The International Virtual Observatory Alliance(IVOA). This body debates and agrees the technical standards and protocols that make the VO possible, and also acts a technology exchange forum, meeting twice a year.

The European Virtual Observatory (Euro-VO). This is a co-ordinating body for European VO work, and has three arms - the VO Facility Centre, led jointly by ESO and ESA, the Data Centre Alliance, led by Strasbourg, and the VO Technology Centre, led by by AstroGrid.

US-VOThe US Virtual Observatory Project.

Here is a talk by me reviewing the origins and the current state of the VO. It was given at the IAU in Prague in summer 2006.

For historical interest, here is the initial proposal of AstroGrid to PPARC in October 2000, and here is the website of the very first VO conference, at Caltech in June 2000.

E-Science in other disciplines.
Of course, these developments are not unique to Astronomy - similar themes are found across all of science, as well as in business, and are major factors in the development of the Web. Transparency of data and exchange of data in business was what drove the development of XML and web services (SOAP and WSDL etc). Machine processing of meaning is what the Semantic Web is all about. ("What sort of things are in columns A and B ? Does it make any sense to add them ?"). Finally, transparency of CPU is what the Grid is all about - getting lots of computers to act as one computer. Personally, I have never liked the term "Grid" as it sounds too rigid.. but it is supposed to be a metaphor for the electric power grid - power on demand, just plug into the wall.

In the UK, we had for many years an explicitly funded "e-Science" programme since 2001. Edinburgh and Glasgow jointly hosted the National e-Science Centre. Sadly this is no longer running.