The data volume challenge

Current day sky survey databases, such as DPOSS, SuperCOSMOS, or SDSS, are all roughly 10TB in size. From 2004 onwards, we expect WFCAM to be producing 20 TB/year of archived raw data. The leap upwards is because of smaller pixel size, short exposures, and repeated coverage. From 2006, we expect VISTA to be producing perhaps 100 TB/year. A few years after that, if and when the LSST becomes a reality, it may be producing 1 PB/year of data we wish to save. WFCAM then is an important intermediate step and test case. Are there significant problems associated with this increased data volume ?

We need to carefully distinguish different kinds of data. A single 2048

image with 4 bytes per pixel takes up 16MB. We will be producing four of these simultaneously every 10 seconds, and may also be employing 1 second non-destructive readouts in order to have useful information on 2MASS stars in the field which might saturate in 10 seconds. The DAS therefore needs to cope with a maximum rate of 12MB/sec. But not all of this is output by the DAS or saved. The peak rate for saved data is given by assuming a collection of four 4K frames each of which is an interleaved set of 2 $\times$ 2 micro-stepped 2K frames, with individual 10 second exposures and an observing efficiency of 0.65. If such a shallow survey mode, with no co-adds, is carried on continously for a 14 hour winter night, we get a peak recorded data rate of 230 GB/night. A considerable part of the science programme will involve such shallow surveying, but on the other hand much will involve co-adding at a single sky position in any one night. If we take the planned UKIDSS programme as representative of all WFCAM use, and take an average night as 10 useable hours, then we get an average recorded data rate of 100 GB/night. Given the expected useage of WFCAM, the raw frame data then grows at $\sim$ 20TB/year. Following processing, there will be calibrated versions of each frame, plus housekeeping information, confidence arrays, and object catalogues, which may in total add up to 30 TB/year for the calibrated frame data. After mosaicing and stacking, the final derived UKIDSS survey maps are $\sim$ 50 TB in total. The density of objects will vary greatly from one location to another, but the final UKIDSS survey object catalogues are expected to be about 5 TB in full, although much smaller stripped down catalogues with only a few key parameters will also be available. Table 1 summarises these data rates and volumes.

**Table 1:** Estimated WFCAM data rates/sizes
Peak data flow at instrument	12 MB/s
Peak recorded data rate	230 GB/nt
Average recorded data rate	100 GB/nt
Raw frame data accumulation	20 TB/yr
Calibrated frame data and ancillary data accumulation	30 TB/ yr
2010 archive : frame data	350 TB
2010 archive : stacked survey maps	50 TB
2010 archive : survey object catalogues	5 TB

Are these rates/volumes scary or not ? In terms of pipeline processing hardware, the rates are fast but not very intimidating. A small cluster will be needed to keep up. Likewise the archive storage does not present serious technological problems, but some moderate expense. Much larger cost and logistical problems are involved in the human operation and curation of the pipeline and archive, along with quality control and science calibration. Between now and 2010, this is where most of the money goes. What about the software development for the pipeline ? If we were building from scratch, or if we were building a completely flexible general purpose pipeline, this could easily eat up many millions of pounds. This is largely why our first priority is a well defined standard pipeline, although we are building this in a modular fashion to allow a cookbook approach to alternative reductions. Likewise the cost of archive software development is strongly dependent on ambition, so we are approaching this in layers (see below). It is also important that we are building on the existing experience and algorithms of CASU and WFAU. For example, the processing of Schmidt plates with SuperCOSMOS by WFAU may seem old fashioned, but actually involves a similar data rate to SDSS (20 GB/day), quality control and feedback to the observing team, removal of instrumental signature, source extraction and photometric and astrometric calibration, proper publication of experimental procedures as well as assessment of survey products[7,,], and ingestion into an on-line science archive (http://www-wfau.roe.ac.uk/sss). Likewise, CASU as well as having similar experience with APM, has been responsible for processing and archiving data for the CCD mosaic public INT Wide Field Survey (INT-WFS)[8,9,10] (see http://www.ast.cam.ac.uk/ wfcsur/index.php), and already has a working pipeline for a smaller four-array IR camera, CIRSI.