next up previous
Next: Design choices Up: The WFCAM/UKIDSS data archive Previous: The data volume challenge


User expectations in the VO era

Storing hundreds of TB of data on spinning disk is not expected to be a major technological challenge, though it will not be a trivial expense. If Moore's law continues, storage will become easier as time goes by. The expensive part of an archive system is the human effort in operation, curation, calibration and documentation. But what about software development of the on-line archive service ? This rather depends on the level of service offered, and the expectations of users are increasing considerably as we enter the Virtual Observatory (VO) era. One can distinguish three levels of on-line archive service : data access, complex queries, and database manipulations. We expect to provide at least the first two for WFCAM, and in collaboration with AstroGrid, the UK VO project, we hope to provide service at the third level.

Straightforward data access is the standard today. One can offer either distinct data subsets such as plates or frames, chosen from a browsable list, or pixel maps and catalogues over user-defined areas, created from some survey database. Typically this is done through a web-browser and CGI script interface, with the user seeing a JPEG image, and then offered download of a FITS file. For such data subset access, serving a large volume doesn't make much difference, as long as the data is sensibly indexed. Access for arbitrary small datasets normally takes only a few seconds. Download of larger datasets to users is limited by network bandwidth, not by service at the data centre. The most common technical solution is storage on a RAID disk array speaking to a web server. The data can be stored in and interrogated through a Data Base Management System (DBMS), but actually flat files and home grown software are often used and work very fast. (Such systems however don't transfer well to the next database.)

With the second level of service, complex queries, one can construct questions in SQL or a similar language, along the lines of ``give me a list of objects redder than X in this area of sky, with measurable proper motion, that have such-and-such quality flag better than Y (unless R-mag is brighter than Z, in which case accept anything), and that were found on a Tuesday''. This kind of service is becoming slowly more common now. An example is the SDDS science archive, used either with the user-installed SDSS qT query tool, or through a web browser interface (see http://archive.stcsci.edu/sdss). Another example is the 2dF Galaxy Redshift Survey web page run from Mount Stromlo (http://www.mso.anu.edu.au/2dfGRS). This requires a proper DBMS of some kind (Objectivity for SDSS, soon changing to MS SQL server, and miniSQL for the 2DFGRS) and some sort of user-tool interfacing between the astronomer and the DBMS. To be able to answer arbitrary complex queries on very large multi-parameter datasets efficiently needs intelligent indexing and cacheing. Even then, there will always be occasions when very large numbers of table rows need to be searched through sequentially and only brute force will do. Most such requirements are CPU-disk I/O limited, and sometimes seek-time limited. Even at 100 MB/s, a 10 TB database takes over a day to search. The growingly accepted solution is to have a cluster with many CPUs searching in parallel. Such a search engine can provide the catalogue storage and the CPU power at the same time.

The leading edge now, which can be seen as Level-2a, is in federated queries, i.e. the ability to make joint queries of arbitrary databases distributed round the world - e.g. ``give me all the objects in the UKIDSS LAS survey which were not seen in the SDSS but do have an X-ray ID in either a Chandra or an XMM observation, and check the list against the ESO VLT observing log''. This is the key problem being tackled by the various VO projects worldwide, and involves standardisation of data and metadata, but also use of new standardised ``web service'' data exchange methods with XML formatted data, SOAP message wrappers, and service description with WSDL. It also needs some kind of astronomical registry service, and a standardised method of `single sign on'' using some kind of digital certificate rather than a multiplicity of passwords. The prospects of the necessary technological solutions being in place by the end of WFCAM's first year of operation are good, so we anticipate participating in these kind of federated queries, and indeed will work closely with the AstroGrid project in particular to make the WFCAM archive and associated services VO-ready.

The third level of service involves large database manipulations. What we have in mind is things like calculation of correlation functions, cluster analysis in N-D parameter space, making statistical digests so that one can find objects 5$\sigma$ outside the main clump, visualisation and exploration of multi-faceted datasets, and so on. Today, such data-intensive calculations are the province of specialised ``power users'' on their own machines, but we expect that such calculations will increasingly be provided as a standard, fast, service at the data centre, and that it will become common to do exploratory analysis this way as well as rigorous calculations. Furthermore, because of the increasingly large archive volume and network limitations, it will be more practical to use a service provided by a data centre than to download huge amounts of data and hack your own code. Such data-intensive calculations usually need N$^2$ or N$\log N$ algorithms, so PC farms, with slow interconnect between nodes, are too slow, and one needs a proper multi-processor SMP machine. In other words, to offer this kind of service, then as well as a PC farm search engine, one needs an expensive analysis engine, and facility-class data analysis software to go with it. This is a major challenge but one we hope to work towards in combination with AstroGrid and other VO projects.


next up previous
Next: Design choices Up: The WFCAM/UKIDSS data archive Previous: The data volume challenge
Nigel Hambly 2002-10-02