Key design and analysis tasks: 2002

The process of analysis and documentation has already started for the WFCAM project. Outline plans and consultation documents can be viewed via the UKIDSS web pages¹¹. These can be considered as preliminary versions of some of what follows. We believe that Universal Modelling Language¹² (UML) is a good choice for top-down analysis and design for our science archive project because the many complex interactions and activities are naturally expressed in such a language.

The first two tasks are currently being undertaken, resourced from our existing grant:

End-to-end top level design
Using the UML, we are taking a top level approach to modelling and organising the WFCAM data analysis. Appendix A shows examples of three types of UML diagrams illustrating possible standpoints from which to model the system. The `package diagram' shown provides a big-picture view of the data analysis frameworks undertaken at the telescope site, CASU, and WFAU. The `use case' diagram depicts some of the functionality and services offered at the end user science resource at WFAU. The `sequence diagram' models the steps involved in processing a query and request for some specialized data product archived at WFAU. Here it is envisaged that this facility is implemented as a Web service using Simple Object Access Protocol (SOAP) to transmit user queries and responses via the Internet.¹³

Science Requirements Document (SRD)
This is the fundamental base for all subsequent analysis and work. It is clearly important that all pipeline and archive work be science-driven. The SRD will contain details of `use cases' of the science archive. These will illustrate the kinds of studies it is envisioned will be undertaken. Ideally, the `customers' of the final science data products should produce the SRD using community-wide consultation. In practice, however, this is unlikely to be easily achieved for WFCAM so we propose joint development of the SRD between WFAU, CASU, UKIDSS and VISTA with due regard to the existing VISTA SRD and AstroGrid `use cases'. Because the SRD is so fundamental, a draft is needed as soon as possible; in any case the SRD will be completed by the end of Q3 '02.

The remaining tasks will be undertaken during the new grant period, and we detail our resource requirements in each case:

Definition of science data products (DDP)
This follows on directly from the SRD, and again will be jointly developed with CASU/VISTA/UKIDSS. Science products to be formally defined include those of pixels, object catalogues (including source parameter defintions), housekeeping `metadata', and merged catalogues (eg. multi-colour/multi-epoch datasets). The DPP will describe how output from the archive will meet the requirements set out in the SRD. The intention is that the DPP will contain enough detail that programmers take on the task of software design, coding and implementation (see below). We estimate 0.1 staff-years will be required for this task.

Data flow document (DFD)
The data flow document will analyse the end-to-end data flow for WFCAM, from the telescope to the end user. Although the DFD will be informed to a certain extent by the SRD and DDP, it will be produced concurrently with them. We estimate 0.2 staff-years will be required for this task.

Archive software architecture design
This will run in an experimental phase from '02 Q4, and then in a final design phase after fixing hardware architecture following delivery of the SRD, DDP and DFD. It will consist of a top-level design for programmers, including flow diagrams and UML analysis. Issues to be considered include programming language choice (eg. C/C++/Java/Perl etc.), web server protocols (eg. XML, http and/or ftp), database management system choice (eg. object-oriented versus relational) amongst others. Software analysis will run from '02 Q4 concurrently with formal top-level analysis for the SRD, DDP and DFD. Again, some issues (eg. web server protocols) can be investigated and prototyped independently to the final data products definition. In this way, coding effort will be evenly spread and we can take advantage of our existing datasets and web services as a development test-bed. We estimate 0.5 staff-years will be required for this task.

Archive hardware architecture design
Hardware analysis with existing equipment and archives at WFAU (eg. experimental Beowulf; SSS/Halpha datasets on RAID etc.) is currently being undertaken to enable benchmarking and prototyping for the WFCAM system. This work has started before delivery of the SRD, DDP and DFD since some issues will not depend critically on the exact content of the final science archive. Subsequently, final hardware architecture design for the end-user science archive follows naturally from the previous R&D phase and delivery of the SRD, DDP and DFD. On completion of this we will reach a critical milestone: hardware purchase. Issues to be considered in the hardware design include fast online storage for survey pixels; mass `nearline' storage for all science-worthy pixel data; fast engines for for speedy catalogue searching (e.g. we expect some kind of distributed storage solution); fast analysis machines (possibly some form of SMP unit) and finally network configurations (e.g. fibre-linked Storage Area Network solutions versus traditional Ethernet). This analysis will not be done in isolation; we expect the WFCAM/VISTA archive hardware solution to be decided with JAC/CASU/VISTA and informed by the experience and studies of AstroGrid and the ESO Next Generation Archive Systems Technologies project¹⁴. We anticipate sourcing other funding in addition to `e-science' displaced money. For example, industrial sponsorship in the form of loaned and/or donated equipment (we are currently discussing donation of an SMP machine with a major hardware vendor). We reiterate that some of our existing hardware has been financed via University and JREI initiatives.

An estimate of hardware costs is as follows: maximum data accumulation rate of science worthy data is likely to be 10s of Tbytes per year¹⁵. As an indication of the likely scalable hardware solution for large science archives, we have costed a Beowulf cluster at £2K per CPU and £1K per hard disk. Presently, 180 Gbyte disks are the high capacity standard; 3 disks per CPU can be easily accomodated in current designs. A start-up system with 50 CPUs, each having one 180 Gbyte disk would give a 9 Tbyte storage capability for a capital outlay of £150K. An expanded system would be have a capacity 27 TByte (3 disks per CPU) at a cost of £250K including a 3 yr maintenance contract, but excluding VAT; we have put this figure in to the RG2 as a guideline figure for hardware costs but expect to fully analyse the requirements and justify the resource requirements at the end of 2003. With the advent of 400 Gbyte disks, such a system is upgradable to 60 Tbyte. We do not currently envisage a need for offline tape backup at this stage since JAC/ESO will keep tape archives (WFCAM/VISTA respectively) and in the case of WFCAM, CASU will have tape copies of raw data and disk copies of processed data.

The above is summarised in items 1 to 8 of the GANTT chart. Critical review points/milestones are the delivery of the SRD (end '02 Q3) and delivery of the hardware design (end '02 Q4) at which point the tendering process for hardware acquisition starts. Concurrent with the above is work on the SX database system and it's implementation on a parallel processing system (this work has already started). This is itemised in the next section as a software deliverable. The following summarises the resource requirements, period and deadline of deliverables for design and analysis tasks:

Task	Resource/	Period	Deliverable
	staff-years		deadline
DDP	0.1	'02 Q4	end Dec '02
DFD	0.2	'02 Q4	end Dec '02
S/W architecture	0.5	'02 Q4 thru '03 Q1	end Mar '03
H/W architecture	0.3	'02 Q4	end Dec '02

Total:	1.1

With reference to the GANTT chart in Appendix B, there are the external tasks of migrating SX to new DBMS software and the AstroGrid DBMS study which we expect to inform our next decision milestone, that of fixing the DBMS architecture (end '03 Q2). Our plan then enters the coding (software) phase.

Key design and analysis tasks: 2002 - 2003