Archive scientist curation use cases for the WSA

Nigel Hambly Wide Field Astronomy Unit (WFAU), Institute for Astronomy, University of Edinburgh

Revision date: March 13, 2003

For the purposes of progressing database design in the WSA project, this document describes a set of `curation use cases'. The idea is to examine the WSA Science Requirements Document and Usages of the WFCAM Science Archive to define the required curation procedures, and hence to broadly outline the data entities in the WSA and their upkeep. In this context, `curation' refers to the process of transfering and ingesting WFCAM data, generation of new data products from the standard pipeline data products, and management of those data products in the Science Archive.

The curation use cases are split into four broad categories, both for reasons of clarity and because requirements on various timescales give rise to distinct tasks:

Daily curation use cases
Periodic (weekly/monthly) curation use cases
Occasional curation use cases
Release time curation use cases

Also, the idea is to spread curation tasks in time as much as possible so as to spread the processing load on catalogue servers.

Because various survey data have various proprietory periods, it will be necessary to curate various databases having different access restrictions. We propose to curate online UKIDSS, open time and `world' (ie. unrestricted access) databases, as well as incremental, offline versions of the same.

A database entity (table) will be required that keeps track of which curation tasks have been applied at any given time and to any given programme dataset.

Daily curation use cases

The following tasks occur on a daily basis. Pipeline processing will take place on a night-by-night basis and daily transfer and ingest is necessary to keep up with the end-to-end system data rate:

CU1: Obtain science data from CASU.

Pipeline processed data will be transfered from the pipeline processing centre (CASU) via the internet. This use case consists of

transfering data
verfication of transfered data - ensure integrity of data files obtained
log transfer - keep track of which data have been transfered and at what time

Logged information from all curation tasks must be held in the database so that it is queryable by users. These logs will take the form of separate tables of data, or in this case, logged information held with the image metadata.

CU2: Create `library' compressed image frame products

For speedy access to image data (eg. for finder chart purposes), compressed images are to be employed:

JPEGs/GIFs
`science quality' lossy compressed frames, eg. H-compress or equivalent

CU3: Ingest details of transfered and library compressed images into archive DBMS, including their standard astrometric and photometric calibrations

Gbyte-sized FITS images will not be stored as BLOBs in the DBMS as this would result in heavy IO usage that would impact on more time-critical catalogue DBMS queries. Images will be stored as flat files; however their details (FITS header keywords, filenames, astrometric details, pipeline processing information, quality control information etc.) will be stored in the DBMS to track these files (from curation and use points of view).

CU4: Ingest single frame source detections into appropriate detection lists in the incremental (offline) archive DBMS.

All output from the standard pipeline source extraction algorithm needs to be stored in the archive.

Periodic (weekly/monthly) curation use cases

The following use cases occur periodically. The idea here is that the WSA usages require production of data products (catalogues) within the DBMS - it will be impossible to merely append data to existing entities (tables) within the DBMS since creating a science-usable survey dataset requires many non-linear operations (eg. pairing, indexing, astrometric and photometric recalibration). Exactly what timescale is practical will become clear with experience (eg. weekly, fortnightly or monthly) but the order in which the following use cases occur is clearly important (eg. spatial indexing must precede merging, which in turn must precede computation of proper motions, etc).

CU5: Create "library" H₂-K difference image frame products

After accumulation of a certain amount of image data it will be possible to update products resulting from combinatorial operations on individual image frames where this is not possible on a daily basis if image multiples are not guaranteed to be observed and processed together (generally the case).

CU6: Create spatial index attributes for all records having celestial co-ordinates

The simplest usages of the WSA (position/proximity searches) and also curation use cases such as pairing will be made much more efficient if the database entities are spatially indexed in some way (in it's simplest form, such indexing would take the form of sorting on one co-ordinate; the WSA will use a more sophisticated approach, eg. Hierarchical Triangular Mesh (HTM) indexing.

CU7: Recalibrate photometry

Full-blown photometric solutions over one or more photometric nights within an observing block will be undertaken. This will consist of:

Associating detection records with photometric standards
Solving for photometric coefficients (zeropoints, first- and second-order extinction coefficients, etc.)
Updating calibration coefficients within the DB, and updating any records where colours are stored explicitly

CU8: Create/update merged source catalogues to the prescription available for a given survey from the appropriate detection list (CU4)

Standard CASU processing will not produce any merged catalogue products. Once the single passband detections are stored within the archive, a detection association algorithm will execute and produce merged multi-colour, multi-epoch records appropriate to the available data within a given survey dataset. Separate merged source catalogue products will be required for different UKIDSS sub-surveys and open time programmes. The `world' catalogues will be recreated at release time (CU19). Each merged catalogue dataset will have an associated table of logged information containing details of the merge run and what fields have been included to date.

CU9: Produce list-driven measurements between WFCAM passbands

Standard source detection will involve setting a threshold for detection; however in the context of data mining it may be important to have source extraction and detection limits (ie. photometric measurements) at positions and with apertures/profiles/deblending defined, for example, by detections across all bands. This philosophy follows the SDSS, where flux measurements at standard positions and in standard apertures are made in all bands when a detection is present in at least one band.

CU10: Compute/update proper motions and other multi-passband derived quantities

Other multi-colour attributes include, for example, extinction and dereddened apparent magnitudes which have been suggested within the UKIDSS GPS.

Occasional curation use cases

Occasional tasks are associated with newly available, externally produced data products from other survey programmes that are required to be held in the WSA for the purposes of joint querying to enable many of the science goals of UKIDSS, for example. Astrometric recalibration will also be undertaken occasionally.

CU11: Recalibrate image/detection astrometry

After data have accumulated for a sufficient time, low-level systematic errors in astrometry may become apparent, and it may be possible to remove these; furthermore, new astrometric reference catalogues may become available over the lifetime of the WSA, in which case astrometric recalibration is in order.

CU12: Get publicly released and/or consortium-supplied (eg. complementary optical) external catalogues

It will be necessary to update the locally stored (but externally produced) survey products (eg. SDSS, 2MASS, etc.) where new releases of those products have been made; the UKIDSS consortium also has a programme of complementary optical imaging for the purposes of combined optical/IR science.

Release curation use cases

Proprietory PI programmes should be released to the proprietors asap; updates to UKIDSS surveys will naturally occur on a timescale dictated by blocks of WFCAM on-telescope periods.

CU13: Create "library" stacked and/or mosaiced image frame products

After accumulation of a certain amount of image data it will be possible to update products resulting from combinatorial operations on individual image frames.

CU14: Create standard source detection list from any new "library" frame product(s)

For example, if a new UDS stack has been made, then standard source detection should be run on the resulting image data. Curation use cases CU6 to CU9 would then apply:

CU15: run periodic curation tasks 6 though 9

- for any newly created stacked/mosaiced image product.

CU16: Create default joins with external catalogues

For the purposes of cross-IDs/neighbours for rapid cross-querying.

CU17: Produce list-driven measurements between WFCAM detections and non-WFCAM image data, where possible

Again, following CU9, and with usages such as U1 in mind. That example concerns the UKIDSS LAS and SDSS UGRIZ imaging data, where IR sources with upper limits in IZ are sought for candidate very cool objects.

CU18: Create/recreate table indices

Within a given table, an index will be created on a combination of commonly used attributes so that at query time, the query optimizer will make use of these indices to greatly enhance performance.

CU19: Verify, `freeze', and backup

Verification takes the form of examining the curation log to check that no further curation is needed for any given programme/survey dataset. This curation task will create a `world' readable subset of the given programme dataset based on the proprietory period of the observations (tracked through the database) and the current release date. Prepared survey DBs should be fixed and backed up for security

CU20: Release - place new DB products online

This task is the final step: any newly created database products will be placed on a publicly accessible catalogue server. At the same time, a `world'-readable `view' of the programme subsets will be created to present a single, logical database of all WFCAM observations having unrestricted access at that date.

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.2 beta6 (August 14th, 1998)

The command line arguments were:
latex2html -split 0 curate

The translation was initiated by Nigel Hambly on 2003-03-13

Nigel Hambly
2003-03-13