The process of analysis and documentation has already started for the WFCAM project. Outline plans and consultation documents can be viewed via the UKIDSS web pages11. These can be considered as preliminary versions of some of what follows. We believe that Universal Modelling Language12 (UML) is a good choice for top-down analysis and design for our science archive project because the many complex interactions and activities are naturally expressed in such a language.
The first two tasks are currently being undertaken, resourced from our existing grant:
End-to-end top level design
Using the UML, we are taking a top level approach to modelling and
organising the WFCAM data analysis. Appendix A shows examples of three
types of UML diagrams illustrating possible standpoints from which to
model the system. The `package diagram' shown provides
a big-picture view of the data analysis frameworks undertaken at the
telescope site, CASU, and WFAU. The `use case' diagram depicts some of
the functionality and services offered at the end user science resource
at WFAU. The `sequence diagram' models the steps involved in
processing a query and request for some specialized data product
archived at WFAU. Here it is envisaged that this facility is
implemented as a Web service using Simple Object Access Protocol
(SOAP) to transmit user queries and responses via the
Internet.13
Science Requirements Document (SRD)
This is the fundamental base for all subsequent analysis and work. It is
clearly important that all pipeline and archive work be science-driven.
The SRD will contain details of `use cases' of the science archive. These will
illustrate the kinds of studies it is envisioned will be undertaken.
Ideally, the `customers' of the final science data products should produce
the SRD using community-wide consultation. In practice, however, this is
unlikely to be easily achieved for WFCAM so we propose joint development
of the SRD between WFAU, CASU, UKIDSS and VISTA with due regard to the
existing VISTA SRD and AstroGrid `use cases'.
Because the SRD is so fundamental, a draft is
needed as soon as possible; in any case the SRD will be completed by the
end of Q3 '02.
The remaining tasks will be undertaken during the new grant period, and we detail our resource requirements in each case:
Definition of science data products (DDP)
This follows on directly from the SRD, and again will be jointly developed
with CASU/VISTA/UKIDSS. Science products to be formally defined
include those of pixels, object catalogues (including source parameter
defintions), housekeeping `metadata', and merged catalogues (eg.
multi-colour/multi-epoch datasets). The DPP will describe how output
from the archive will meet the requirements set out in the SRD.
The intention is that the DPP will
contain enough detail that programmers take on the task of software design,
coding and implementation (see below).
We estimate 0.1 staff-years will be required for this task.
Data flow document (DFD)
The data flow document will analyse the end-to-end data flow for WFCAM, from
the telescope to the end user. Although the DFD will be informed to a certain
extent by the SRD and DDP, it will be produced concurrently with them.
We estimate 0.2 staff-years will be required for this task.
Archive software architecture design
This will run in an experimental phase from '02 Q4, and then in a final
design phase after fixing hardware architecture following delivery of the
SRD, DDP and DFD. It will consist of a top-level design for programmers,
including flow diagrams and UML analysis. Issues to be considered
include programming language choice (eg. C/C++/Java/Perl etc.), web server
protocols (eg. XML, http and/or ftp), database management system choice
(eg. object-oriented versus relational) amongst others. Software analysis
will run from '02 Q4 concurrently with formal top-level analysis for the
SRD, DDP and DFD. Again, some issues (eg. web server protocols) can be
investigated and prototyped independently to the final data products
definition. In this way, coding effort will be evenly spread and we can
take advantage of our existing datasets and web services as a development
test-bed.
We estimate 0.5 staff-years will be required for this task.
Archive hardware architecture design
Hardware analysis
with existing equipment and archives at WFAU (eg. experimental Beowulf;
SSS/Halpha datasets on RAID etc.) is currently being undertaken
to enable benchmarking
and prototyping for the WFCAM system. This work has started before delivery
of the SRD, DDP and DFD since some issues will not depend critically on the
exact content of the final science archive. Subsequently, final
hardware architecture design for the end-user science archive follows
naturally from the previous R&D phase and delivery of the
SRD, DDP and DFD. On completion of this we will reach
a critical milestone: hardware purchase. Issues to be considered in the
hardware design include fast online storage for survey pixels; mass
`nearline' storage for all science-worthy pixel data; fast engines for
for speedy catalogue searching (e.g. we expect some kind of distributed
storage solution); fast analysis machines (possibly some form of SMP unit)
and finally network configurations (e.g. fibre-linked Storage Area Network
solutions versus traditional Ethernet). This analysis will not be done in
isolation; we expect the WFCAM/VISTA archive hardware solution to be
decided with JAC/CASU/VISTA and informed by the experience and studies of
AstroGrid and the ESO Next Generation Archive Systems Technologies
project14. We anticipate sourcing
other funding in addition to `e-science' displaced money. For example,
industrial sponsorship in the form of loaned and/or donated equipment
(we are currently discussing donation of an SMP machine with
a major hardware vendor). We
reiterate that some of our existing hardware has been financed via
University and JREI initiatives.
An estimate of hardware costs is as follows: maximum data accumulation rate of science worthy data is likely to be 10s of Tbytes per year15. As an indication of the likely scalable hardware solution for large science archives, we have costed a Beowulf cluster at £2K per CPU and £1K per hard disk. Presently, 180 Gbyte disks are the high capacity standard; 3 disks per CPU can be easily accomodated in current designs. A start-up system with 50 CPUs, each having one 180 Gbyte disk would give a 9 Tbyte storage capability for a capital outlay of £150K. An expanded system would be have a capacity 27 TByte (3 disks per CPU) at a cost of £250K including a 3 yr maintenance contract, but excluding VAT; we have put this figure in to the RG2 as a guideline figure for hardware costs but expect to fully analyse the requirements and justify the resource requirements at the end of 2003. With the advent of 400 Gbyte disks, such a system is upgradable to 60 Tbyte. We do not currently envisage a need for offline tape backup at this stage since JAC/ESO will keep tape archives (WFCAM/VISTA respectively) and in the case of WFCAM, CASU will have tape copies of raw data and disk copies of processed data.
The above is summarised in items 1 to 8 of the GANTT chart. Critical review points/milestones are the delivery of the SRD (end '02 Q3) and delivery of the hardware design (end '02 Q4) at which point the tendering process for hardware acquisition starts. Concurrent with the above is work on the SX database system and it's implementation on a parallel processing system (this work has already started). This is itemised in the next section as a software deliverable. The following summarises the resource requirements, period and deadline of deliverables for design and analysis tasks:
Task | Resource/ | Period | Deliverable |
staff-years | deadline | ||
DDP | 0.1 | '02 Q4 | end Dec '02 |
DFD | 0.2 | '02 Q4 | end Dec '02 |
S/W architecture | 0.5 | '02 Q4 thru '03 Q1 | end Mar '03 |
H/W architecture | 0.3 | '02 Q4 | end Dec '02 |
Total: | 1.1 |
With reference to the GANTT chart in Appendix B, there are the external tasks of migrating SX to new DBMS software and the AstroGrid DBMS study which we expect to inform our next decision milestone, that of fixing the DBMS architecture (end '03 Q2). Our plan then enters the coding (software) phase.