|Draft||Oct 2002||Original version (NCH, AL & IAB)|
|V1.0||Jan 2003||Revised (NCH)|
The purpose of this data flow document (DFD) is to examine data transfer rates, accumulating data volumes, data formats and data curation issues for the WFCAM Science Archive (WSA) project. This, in turn, is intended to inform the hardware and software design of the WSA. Issues to be considered include import/export rates/volumes/formats; post-processing; DBMS ingest and processing (eg. indexing); and backup/update/releases. The major goal of this document is to inform decisions concerning storage hardware and investigation of network connectivity.
In this context, we consider the `flow' in data flow as encompassing i) import, ii) curation, and iii) use (including export); where `data' encompasses pixels/catalogues/housekeeping information for both WFCAM and complementary datasets (see the SRD). The DFD is organised into Sections following these three broad categories.
An overview of the end-to-end data flow, including data rates and accumulation volumes, is given in Appendix A.
This section examines the end-to-end data flow for the UKIRT WFCAM project in order to estimate
The following assumtions can be made about WFCAM; these are unlikely to change:
LAS 183.4 nights 160 Gbyte per night = 29.3 Tbyteor an average over the science programme of Gbyte per full night.
GPS 130.2 92 12.0
GCS 58.8 160 9.4
DXS 123.9 52 6.4
UDS 130.2 52 10.8
This last item needs closer inspection. Appendix D lists
the baseline set of parameters per detected object from CASU standard
pipeline processing. Assuming 4 bytes per parameter (they will be mainly
single precision floating point numbers) this is 66 parameters
4 bytes = 264 bytes per detection. Further, for the purposes of list-driven
co-located photometry (see the SRD; ie. given a detection in one passband,
what are the object parameters at fixed positions using fixed apertures
and profiles in all other passbands) this value should be scaled appropriately
for UKIDSS passbands (and ultimately another 5 SDSS passbands for
total generality; again, see the SRD). So, to order of magnitude, the
catalogue records size is bytes per detected object. Now, the
number of detected objects per frame will vary enormously. For example, in the
UKIDSS GPS, towards the Galactic centre the surface density of sources is
likely to be per sq. deg. (or objects per pixel)
while in the lowest surface density regions of the LAS this is likely to
drop to per sq. deg. (or objects per pixel). If
we assume a typical surface density of sources as being per
sq. deg., or objects per pixel, then for a given amount of
pixel data the object catalogue overhead is
Some relevant details on individual parts of the data flow scheme above:
1,2,3: JAC/ATC responsibility; JAC will likely make offline tape archive of raw pixel data; couple of weeks online buffer storage will probably be used; ATC proposes 4-way parallelisation of the DAS & summit pipeline data chains; output format will be Starlink NDF (JAC archive) and MEFs (multi-extension FITS files) for transport to CASU.
4: For ease of housekeeping/transport/handling, one disk or tape per night would be advantageous. Peak data rate is 230 Gbyte/night. One or two disks may be employed to ship the data; alternatively tapes may be used. Currently, the highest capacity system would appear to be `Linear Tape Open' (LTO) which can store Gbyte native, and would probably manage a night's worth of data (with a bit of lossless compression) on all but the most productive of nights. The transfer rate for LTO is reportedly 100 Gbyte/hour. By the time WFCAM becomes operational, there may be higher capacity in this system, or higher capacity alternatives.
5: CASU pipeline will derive/add data to the images ingested from JAC: i) housekeeping info; ii) object catalogues; iii) confidence arrays. As stated in the assumptions above, a 10% increase on pixel data volume will allow for housekeeping, object catalogues and DBMS overheads; however iii) is potentially a large increase on raw pixel volume/rates. For example, if a 2-byte confidence value per 4-byte pixel is added for any image that is likely to be stacked (cf. current CIRSI pipeline processing) then volumes/rates increase by 50%. The greater fraction of the UKIDSS science program will be stacked to increase depth, so a conservative assumption would be to increase ALL pixel data volumes/rates by a factor 1.5; however it is unlikely that confidence values will be needed on a pixel-by-pixel basis; rather nightly library confidence frames will suffice. In this case the overhead will be small and can be subsumed into the existing 10% overhead.
7: An estimate of the yearly rate can be made as follows. Nights per year are likely to be 80% UK time on UKIRT , the fraction of all UK time given over to WFCAM. Assume 110 Gbytes per night average, and for the likely range assume . Then, the average yearly data accumulation rate will be between 19 and 26 Tbytes.
For science archiving, it is important to distinguish between storage requirements for `immediate' access (or `as fast as possible' access) and storage for less time critical usages. An examples of the former is where an astronomer wishes to trawl object catalogues for rare objects, where data exploration (ie. interaction in real time) is important. An example of the later is where a `power user' wishes to reprocess a large fraction of survey data to look for objects that they believe were missed in the standard pixel processing pipeline (eg. large-scale, low surface brightness objects). The split in usages requiring fast/slower (real-time/offline) response times is a split between catalogue usage and pixel data usages, broadly speaking.
An estimate of the final pixel storage requirement for UKIDSS at least is straightforward: assuming 4 bytes per pixel and microstepping (ie. 0.2 arcsec pixels); the areas of the LAS, GPS, GCS, are respectively 4000 sq. deg. filters; ; (stacked pixel data for the DXS and UDS are negligible for these purposes). This adds up to Tbytes; the final UKIDSS object catalogues and associated data will be Tbytes.
At each point in the end-to-end system, the data flow volumes/rates and some hardware requirements can be roughly stated as follows:
6: If network, then need to be able to transfer 250/110 Gbyte/day (peak/average). Note: a 1 Gbit/s continuous link would enable 450 Gbyte/hr to be transfered.
8: User access: SSS is Gbyte/week; suggest it is likely that WFCAM archive is likely to be 10x more, hence Gbytes/day.
In summary, data flow for the WSA will be:
The uncertainties above (eg. detected objects per pixel; the amount of confidence array information needed to be stored, etc.) should not prevent progress on hardware design and acquisition, since storage for the final data volume does not have to be purchased up front. Provided sufficient storage is acquired for the first year of operation, it will become clearer during that time what the precise long term requirements are. In any case, the lifetime of the WSA project is significantly longer than the typical timescale of leaps in computer hardware design, so it should be expected that the initial hardware solution will not be the final one, and a phased approach (as is required from the science exploitation point of view; see the SRD) is implied.
Numbers above are of course dominated by the volume of pixel data. If it is decided that it is unnecessary to archive processed pixels in uncompressed form, then storage volumes & data rates can be reduced dramatically. For example, if the archive contains 10x H-compressed pixels, then all numbers from 6 onwards can be reduced by %. However, there is a very clear requirement in the SRD for online archiving of unadulterated pixel data.
In specifying the requirements, a chance could be taken on assumption of the average number of usable nights: eg. UKIDSS proposal suggests that on average, 70% of allocated nights will produce science data, so volumes/rates can be decreased by 30% (but note that on a daily basis the data flow system should still be able to cope with peak data rates produced by hopefully many perfect nights).
APM/SuperCOSMOS/INT WFC/CIRSI analysis produces 32 4-byte parameters per detected object. This has been enhanced to include extra parameters for flux estimation and error estimates. The following is the suggested list for the standard WFCAM pipeline:
No. Name Description
1 Seq. no. Running number for ease of reference, in strict order of image detections
2 Isophotal flux Standard definition of summed flux within detection isophote, apart from
detection filter is used to define pixel connectivity and hence which
pixels to include. This helps to reduce edge effects for all isophotally
3 X coord Intensity-weighted isophotal centre-of-gravity in X
4 Error in X estimate of centroid error
5 Y coord Intensity-weighted isophotal centre-of-gravity in Y
6 Error in Y estimate of centroid error
7 Gaussian sigma These are derived from the three general intensity-weighted second moments.
8 Ellipticity The equivalence between them and a generalised elliptical Gaussian distribution
9 Position angle is used to derive Gaussian sigma =
Position angle = angle of ellipse major axis wrt x axis
10 Areal profile 1 Number of pixels above a series of threshold levels relative to local sky.
11 Areal profile 2 Levels are set at T, 2T, 4T, 8T ...128T where T is the threshold. These
12 Areal profile 3 can be thought of as a sort of poor man's radial profile. Note that for now
13 Areal profile 4 deblended, ie. overlapping images, only the first areal profile is computed
14 Areal profile 5 and the rest are set to -1 flagging the difficulty of computing accurate
15 Areal profile 6 profiles.
16 Areal profile 7
17 Areal profile 8
18 Peak height in counts relative to local value of sky - also zeroth order core flux
19 Error in pkht
20 Core flux Best used if a single number is required to represent the flux for ALL
objects. Basically aperture integration with radius rcore (in the FITS
header) but modified to simultaneously fit `cores' in case of overlapping
images. Best scaled to FWHM for site+instrument.
Combined with later-derived aperture corrections for general photometry.
21 Error in flux
22 Core 1 flux A series of different radii core/aperture measures similar to parameter 20
23 Error in flux
24 Core 2 flux Together with parameter 18 these give a simple curve-of-growth analysis from
25 Error in flux
26 Core 3 flux peak pixel, rcore, rcore, rcore, rcore, rcore,
27 Error in flux rcore, rcore, rcore, rcore, rcore,
28 Core 4 flux rcore, rcore
29 Error in flux
30 Core 5 flux basic core, ensures % of PSF flux
31 Error in flux
32 Core 6 flux Extras for generalised galaxy photometry further spaced
33 Error in flux
34 Core 7 flux by in radius to ensure correct sampling out to
35 Error in flux
36 Core 8 flux reasonable range of aperture sizes
37 Error in flux
38 Core 9 flux Note these are all corrected for pixels from overlapping neighbouring images
39 Error in flux
40 Core 10 flux
41 Error in flux
42 Core 12 flux Biggest would be rcore ie. 30 arcsec diameter
43 Error in flux
44 Petrosian radius as defined in Yasuda et al. 2001 AJ 112 1104
45 Kron radius as defined in Bertin and Arnouts A&A Supp 117 393
46 FWHM radius average image radius at half PeakHeight
47 Petrosian flux Flux within circular aperture to
48 Error in flux
49 Kron flux Flux within circular aperture to
50 Error in flux
51 FWHM flux Flux within circular aperture to - simple alternative
52 Error in flux
53 Error bit flag Bit pattern listing various processing error flags
54 Sky level Local interpolated sky level from background tracker
55 Sky variance Local estimate of variation in sky level around image
56 Child/parent Flag for parent or part of deblended deconstruct
The following are accreted directly after standard catalog generation
57 RA RA and Dec explicitly put in columns for overlay programs that cannot, in
58 Dec general, understand astrometric solution coefficients. Derived exactly from
WCS in header and XY in parameters 5 & 6
59 Classification Flag indicating probable classification: eg. -1 stellar, +1 non-stellar, 0 noise
60 Statistic An equivalent N(0,1) measure of how stellar-like an image is, used in
deriving parameter 59 in a `necessary but not sufficient' sense
From the further processing pipeline after deriving a suitable PSF
61 PSF flux Fitted flux from PSF
62 Error in flux
63 X coord Updated PSF-fitted X centroid
64 Error in coord
65 Y coord Updated PSF-fitted Y centroid
66 Error in coord
This document was generated using the LaTeX2HTML translator Version 2K.1beta (1.47)
Copyright © 1993, 1994, 1995, 1996,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 0 wsadfd
The translation was initiated by Nigel Hambly on 2003-01-08