Task Force on Seasonal Prediction - data handling strategy

Working Draft 1.0 - 09 Dec 2005

The TFSP meeting in Trieste (August 2005) requested that a working group should create a detailed proposal for data handling for the experimentation to be carried out for TFSP. The proposal is intended to follow the outline strategy discussed and provisionally agreed at the Trieste meeting. This document aims to develop the specifics of the proposed strategy.

A. Trieste strategy

After much discussion of the relative merits of a centralized versus distributed system of sharing data, it was proposed to try a hybrid solution. That is, standards will be set, and producing (or distributing) centres will be able to serve their own data so as to meet the specified standards. Alternatively, if a producing centre would prefer not to be responsible for serving its own data, it can pass the data to another centre which is willing to serve it. It is envisaged that there will be several centres which will be willing and able to serve other people’s data, at least from within a specified region.

The data is envisaged as being served in CF compliant netCDF, probably with an OPeNDAP (ie DODS) interface. Other data formats and data serving options are possible as additional extras, but the netCDF service is a mandatory minimum.

Issues that the working group need to consider are:

* The metadata content needed

* How the metadata will be specified in the netCDF

* Agreement on OPeNDAP as the initial web interface standard, and any issues arising from this.

* The data volumes to be expected, the extent to which they are feasible, and whether certain parts of the data might need to be made optional.

* Identifying sufficient capacity to serve the expected datasets

* Any recommendations on procedures to ensure correctness of served data

* How the strategy relates to other data strategies and projects

B. Working group proposals

0. Introduction

The following proposal takes account of established practice at operational centres, usual practice in the research community, and the protocols established at PCMDI for handling the IPCC data. It also takes account of detailed work undertaken in Europe on how to make ENSEMBLES data available in CF compliant netCDF. Additionally, we bear in mind what might be needed to (partially) harmonize the metadata with the structures being developed by the operational side of WMO for use eg in TIGGE.

Proposal: The TFSP data should be CF compliant netCDF data, with specified metadata content.

To complete the proposal, we need to specify the required metadata, and also give rules and guidance on how the metadata is to be encoded in netCDF, and how files ought to be structured for data exchange. These issues are dealt with in the following sections.

1. Metadata content

In the fist instance, we assume that data to be exchanged are raw model output. If calibrated forecast products, anomalies, climatologies, verification scores etc are to be exchanged, then further metadata will be required. Note that although we will discuss below a particular representation of the metadata in netCDF, it is the metadata themselves which are the most fundamental part of this proposal. The representation of the metadata may change in the future, due either to new versions of netCDF and CF, or possibly even new data formats altogether, but the metadata should be relatively stable. The metadata discussed here are those needed to define a single model integration.

Requirement: the metadata must be machine readable, must properly distinguish different datasets in a way that enables the data to be archived, and must provide metadata useful for data searching. Metadata should also be useable for automatic plot labelling.

It is helpful to distinguish which metadata define the data, and which simply provide additional information. The latter could be used in database searches and for labelling purposes, but would not form part of any archive structure. These additional variables are listed below as comments. Some of the metadata are names in the form of strings. We may want to create different (linked) versions of these, for example one short fixed version (suitable for long term archival purposes) and one slightly longer more descriptive version.

Many of the metadata are logically independent, in the sense that specifying one does not fix the value of another. However, metadata can be linked. For example, we might provide both a long and a short name for an institution, or we might describe certain characteristics of a given experiment identifier. Such logical connections are noted below, since in some representations of the data (notably netCDF), they may affect how the metadata can or should be coded.

Defining metadata:

i. originating_centre: eg Met Office - centre with scientific responsibility for integrations (STRING, max length=6 and/or 16) (definition)

[It has been suggested that this should be coordinated with the work by WMO to define unique identifiers for producing centres, but initial discussions have not been promising. Perhaps we should have the definition being a unique, time invariant short string (length 6), and a separate metadata item such as centre_name, which would include a nice English language name which could be used for labelling etc, and which might change from time to time as institutes re-brand themselves.]

ii. experiment_identifier (STRING, max length=6 or 16). The originating centre is fundamentally responsible for assigning unique experiment identifiers for the different datasets it makes available, and should (ideally) provide documentation of each experiment. It is possible for common experiment identifiers to be agreed between different centres, if they are carrying out a common experiment. But there is no a priori guarantee that identical identifiers from different centres refers to scientifically equivalent experiments. (definition)

iii. forecast_system_version_number (assigned by originating centre; scientific details of the models used etc should be provided via a web link, INTEGER) (definition)

iv. forecast_method_number (default =1) (This distinguishes forecasts made with the same underlying model/forecasting system, but where variations have been introduced such that the different integrations have different properties, most importantly different climate drift. An example is the members of a perturbed parameter ensemble forecast. INTEGER) (definition)

v ensemble_member_number (Different integrations made with the same model and forecasting system, which form a homogenous and statistically indistinguishable ensemble. INTEGER) (definition)

Additional metadata:

i. original_distributor: eg ECMWF - centre with responsibility for operational or research distribution of data, ie the centre who first made the data publicly available, and to whom queries of data integrity should be sent. (STRING, max length=16) (comment)

ii. production_status: operational, research, or a user defined <project_id>. “research” should be used for general research at a centre; project_ids should be used for specified international research projects. (STRING, max length=16) (comment, logically associated with experiment identifier)

iii. model_identifier (no default) (STRING, max length=16) (comment, logically associated with forecast_system_version number)

iv. sst_specification (STRING, “coupled” or “observed” or “predicted” or “persisted anomaly” or “persisted absolute”, logically associated with experiment identifier)

v: real_time “true” or “false”, according to whether the forecast was made in real-time. Not an attribute of the experiment or the system_version, but of the individual forecast.

vi: archive_date “YYYYMMDD” or “unknown”. When the data was produced, archived or published. The aim is to provide an approximate timestamp, to easily distinguish between recent experiments and much older ones. Also, in the case that data need to be corrected in a globally distributed data system, the archive_date could be used to distinguish between the older, original data and the newer, corrected data. An attribute of the individual model integration.

An appropriate definition of “real time” will need to be given. A first proposal is “a seasonal forecast issued less than one calendar month after the nominal start date; or a short to medium range weather forecast issued less than 24 hours after the nominal start date”.

A single experiment from a single centre might include multiple models. Note also that origin/expver/system/method form a natural ‘tuplet’ which defines a particular homogenous forecast, whose ensemble size is then spanned by ensemble_member_number. A ‘multi-model’ forecast consists of a collection of ‘tuplets’. Which elements of the tuplet vary between different members of a multi-model ensemble does not really matter for the processing of the forecast data. Data from each tuplet are treated as statistically separate; different ensemble members of a given tuplet are processed together.

Although not needed for distribution and archive purposes, it is suggested that ‘comment’ metadata should be mandatory, since this will give a homogenous dataset and aid future searching of the data.

The above metadata offer flexibility in describing different experiments, and are intended to allow fairly straightforward mapping from existing metadata practice in the global seasonal forecasting community. We strongly request feedback from producers of seasonal forecast as to whether the above metadata are adequate.

2. Representation of metadata in CF compliant netCDF

CF compliant netCDF provides a language that can be used for describing the data content of a file. It does not provide a natural language for describing data independently of the file in which it is embedded. Further, it does not (as yet) provide a standard logical structure for describing the data in a given file. For example, a set of six fields, with specified attributes, could be described with those attributes in several structurally different ways with CF compliant netCDF. In order to produce files from different groups which are homogenous and consistent (and therefore amenable to straightforward common processing by software) it is necessary to give very detailed instructions on how the data should be written - the requirements for IPCC data are an example of this.

There is an argument that the CF convention should be tightened and/or extended to simplify this process. ECMWF and the ENSEMBLES project are considering proposing an extension to the CF convention which would remove these ambiguities for seasonal forecast data. How such a proposal might look is discussed below. Whether such a proposal will succeed and become part of the CF convention is not yet known, but comments on the ideas are invited.

CF compliant netCDF mandates or recommends the following global attributes, which are designed to document the overall nature of the data:

Conventions “CF-1.0”

Title

Institution

Source

History

References

Comment

The above fields are often filled in as lengthy, human-readable strings, sometimes with multiple pieces of information under one heading. TFSP recommends that these fields are filled following existing best practice, in a way that clarifies to the human reader the source and nature of the data. These “human readable” metadata are intended purely for human consumption, and are not useful for categorizing the data, since they are unstructured and will be filled in in different ways by different groups.

Example:

Title: Meteo-France seasonal forecast data

Institution: “Model run by Meteo-France. Data processed by ECMWF. Data distributed by ECMWF.”

Source: “Data generated by Arpege model, run by Meteo-France at ECMWF.”

History:

References: “http://www.ecmwf.int/products/forecasts/seasonal/documentation.html”

Comment: “Part of EUROSIP multi-model forecast system. Use of data subject to EUROSIP data policy - see web link for details”

Ideally the above would contain more specific information, such as version numbers, system numbers, resolution etc. However, since data are normally generated automatically by computer programs, it is hard to have too much detail in free-flowing text of the above sort without the risk that it becomes inaccurate when details change. Better to be vague than to be wrong.

TFSP recommends that a web link is given which gives access to a full description of the data, the meaning of experiment identifiers etc, and details of data policy if required. The use of a web link is much more appropriate than trying to include large amounts of detail in the netCDF file itself, and also allows relevant information to be kept up to date. The web link given in the global attributes of the file

Since we are recommending a specific schema or layout for data, it may help if conformance to this is indicated by a global attribute, particularly in the case that the meaning of the file structure is not tightly definable by CF compliant standard names.

Thus we propose the global attribute

:schema = “TFSP-1.0”

This also allows a version control on the data layout specification. The string could be WCRP instead of TFSP, if the JSC would like to adopt our standard for wider use.

We now describe how the machine-readable metadata should be encoded in CF compliant netCDF. TFSP recommends that any strings used remain as standardized as possible, ie changes to case, spacing and abbreviations should be avoided. Over a long period of time, institute names etc are likely to change (eg past changes of NMC to NCEP), and it will be necessary to provide appropriate documentation of this to aid data searching.

Outline of proposed CF-compliant netCDF data layout:

dimensions:

latitude=180, longitude=360, level=10,

time=184,initial_time=20,

forecast_number=5, ensemble_member_number=10;

string_max=16;

variables:

float latitude(latitude);

float longitude(longitude);

float level(level);

double time(time);

time:units=”days”;

time:standard_name=”time”;

time:long_name=”time elapsed since the start of the forecast”

double initial_time(initial_time);

initial_time:units=”days since 1900-01-01 00:00)0.0” ;

initial_time:standard_name=”forecast_reference_time”

int forecast_number(forecast_number);

char originating_centre(forecast_number,string_max); (A)

char experiment_identifier(forecast_number,string_max); (A)

int forecast_system_version_number(forecast_number); (A)

int forecast_method_number(forecast_number); (A)

char production_status(forecast_number,string_max); (B)

char model_identifier(forecast_number,string_max); (B)

char original_distributor(forecast_number,string_max); (B)

char sst_specification(forecast_number,string_max); (B)

int ensemble_member_number(ensemble_member_number);

float field(forecast_number,initial_time,ensemble_member_number,

time,level,latitude,longitude)

char real_time(forecast_number,initial_time,

ensemble_member_number,1) (T or F) (C)

char archive_date(forecast_number,initial_time,

ensemble_member_number,string_max) (C)

Here we have chosen to code all of the metadata in the form of variables rather than either global attributes (which are a file-based concept, and would restrict which data could be served in a single file) or attributes of variables (which only works if the attribute has a single value for all the relevant data in the file). The choice to use variables fits with the philosophy of CF, and allows more flexibility when used with appropriate applications, but can make datasets a little more awkward to use with those applications which do not like multi-dimensional datasets.