Aggregation Model for Mooring Data
- Basic Data Unit
o Time Series
Definition:
A sequence of points that are equally spaced in time with NO gaps (missing data uses a flag or fill value).
Uses:
start_time (yyyy-mm-dd hh:mm:ss) GMT or use FGDC conventions.
Number of points (time dimension)
Time_interval (delta-t) units (days, hours, minutes or seconds)
Variable dimension 1-Scaler, 2-Vector(x,y) 3-Vector(x,y,z)
Restriction – if more than one component must use same units (This would exclude speed and direction as a 2-D vector representation).
o Longitude and Latitude (x,y) position of mooring is fixed (time independent).
o Position on Mooring is defined by a depth variable (which may be a vector, and a function of time (points), e.g. depth(depth,time)).
First Ideas for Possible Data Models
A netcdf type structure could contain some of the following:
Dimensions cmpts,depth,time only apply to this variable V001 (unique name for variable at a given depth (range) on the mooring). Note: Fortran ordering is used for the independent variables, in DODS and netcdf CDL, the order would be reversed.
V001(cmpts, depth, time) : Vector Variable (unique name)
V001:long_name
V001:standard_name (COARDS CF names)
V001:units
V001:_FillValue (Flags)
V001:start_time (COARDS format)
V001:number_of_points
V001:time_interval
V001:time_interval_units
V001:nominal_depth(depth) (A level attribute to use as an ID)
V001:component_1_direction (East or 090 degrees_T) (Not used if cmpts=1)
V001:component_2_direction (North or 000 degrees_T)
V001_depth(depth,time) (time optional)
V001_depth:units (“m”)
V001_depth:positive (“down”)
V001_time(time) (COARDS time – not necessary – optional?)
V001_flags(cmpts,depth,time) (optional)
V001_flags:flag_values (Vector of QA/QC flags – CF)
V001_flags:flag_meanings (CF)
V001:instrument_ID
V001:instrument_description
V001:data_comment
V001:filters_applied
V001:institution
V001:ancillary_variables “V001_depth V001_time V001_flags” (CF)
Note: If multiple series are in a single file, single time, depth dimensions cannot be defined for the whole file. CF refers to “NetCDF Climate and Forecast Metadata Conventions” (http://www.cgd.ucar.edu/cms/eaton/cf-metadata/).
An Alternate Way of Using Netcdf to Organize Multiple Time Series Arrays
The above organization suffers from having to find a unique name for each time series and defining the metadata for each named variable. An alternate way is to use the old (Fortran) trick of using a single one-dimensional array to contains all the time series (assume restricted to a single variable type e.g. Temperature), use indexes to distinguish each time series, and then the metadata for each of the time series segments can be arranged into arrays.
Global:
Variable_Type: - e.g. Temperature
Variable_Units: - e.g. degree_C
Time_Step_Units: - e.g. minutes
_FillValue:
Dimensions:
Number_of_Time_Series: M
Variable_Array_Size: Unlimited - nM
Number_of_Separate_Moorings:
Number_of_Distinct_Instruments:
Variables:
T(1 … n1, n1+1 … n2, n2+1 … n3,, n3+1 … … nM-1, nM-1 … nM )
T_Flag(1 … … nM ) QC-Flag (Optional)
T_Depth(1 … … nM ) Only if Depths of measurements are time variable.
Metadata:
Latitude (1 … M )
Longitude (1 … M )
Water_Depth (1 … M )
Mooring_ID (1 … M ) – Links (Indexes) to Mooring Descriptions
Depth_Level (1 … M )
Instrument_ID (1 … M ) – Links (Indexes) to Instrument Descriptions
Start_Time (1 … M )
Stop_Time (1 … M )
Time_Step (1 … M )
Start_Index (1 … M ) = (1, n1+1, n2+1, … nM—1+1)
Number_of_Points (1 … M ) = (n1, n2-n1, … nM- nM-1 )
Comments (1 … M ) - Array of Fixed-Length Strings
Notes:
o This type of netcdf data model corresponds quite closely to the DODS sequence structure below.
o If a the variable is a vector (e.g current or wind), then the variable array can still be arranged the same way with the length increased to 2nM using a (u1,v1, u2,v2, … uM,vM ) arrangement for 2-D vectors, for example, or use separately named arrays for the components.
A DODS data model type structure could be as follows:
Dataset {
Sequence {
Latitude
Longitude
Water Depth
Variable Type (e.g. Temperature)
Units
_FillValue
Description
Sequence {
Nominal Depth
Start_Time
Stop_Time
Time_Step
Number of Points
Instrument ID
Comment
Sequence {
Time (may be a structure, e.g. year, month, day, hour, minute)
Variable (may be a vector)
(QC_Flag)
(Depth)
} time_series
} depth_level
} mooring
}
A Relational Model for metadata might be organized as follows:
Mooring (Table)
Mooring_ID (Unique Key Value)
Latitude
Longitude
Water_Depth
Start_Date
Stop_Date
Description
DODS_Storage_Location (Relative URL)
Deployment (Table for Instrument Info versus deployment)
Mooring_ID (Link to Mooring Table)
Instrument_ID (Multi-key Value) [A Mooring may have multiple instruments, which can change between deployments.]
Deployment_Number (Multi-key Value) - 2 Keys together are Unique
Instrument_Code (Link to and Instrument Description Relation – not given)
Serial_ID (Manufacturer’s Serial Number or equivalent)
Start_Date
Stop_Date
Comments
Depth_Level (Table)
Instrument_ID (Link to Deployment Table – Unique Key Value)
Instrument_Depth
Profiling_Depths (Link to a depth_profile Table if ADCP or similar – not given)
Variable_Types (may be multiple columns of codes to give types of variable measured – codes linked to Variable Definition Table(s) – not given)
Comments
Time_Series (Table)
Filename (Unique Key Value – pointer to location of the actual time series data) [There are usually many different files associated with an Instrument, because of variables measured and degree of processing (e.g. filters).]
Instrument_ID (Link to Depth_Level Table)
Start_Time
Stop_Time
Number_of_Points
Time_Step
Filter_Code (possible link to a Filter Definition Table)
Variable_Codes (which Variable_Types are included from Depth_Level Table)
Variable_Units (see comment below)
Comments
Notes:
This is a fairly complex structure even without supporting definition tables. However, it attempts to follow relational data base rules on normalized tables where there is a minimum of duplication of constant column values in the rows of each table. The structure is built for use with many different data types and levels of processing. For an NVODS aggregation data-model, it could probably be simplified (e.g. only allow 1 variable type for each Instrument_ID) because the tables could be tailored to the specific DODS data stream requested.
There are lots of different ways of structuring metadata, which in this context are the attributes that make a time-series array of values useful to the analysis software (e.g. the fields given above). How it is done is often (and should be) determined by how it is used. Choices made for the tables can restrict flexibility to accommodate different or new situations, instruments or data types.
At SAIC, we use a similar relational structure to the above. The time series are not stored in the relational database because of efficiency and storage requirements, but are stored in flat binary files referenced by the unique Filename in the time_series table. The metadata are not duplicated in the flat files; thus, the relational tables are the authoritative and unique source for essential information for the analysis programs.
Relational databases are used by WOCE to inventory their data (http://woce.nodc.noaa.gov/wdiu/) and would natural way to search time series metadata.
Proposals:
- Only one kind of variable (e.g. temperature or current velocity vector) is present in a aggregation file (if other variables are required – issue another request (search)).
- The retrieved variables are only processed to conform to standards and model structure requirements. Thus, such processes as time or depth averaging are not performed. Thus, a model with multiple variable instances (e.g. temperature at a number of different depths) may have each instance with different time steps and start times relative to an hour (say). Therefore, data are not processed to conform to fixed sequences of time and/or depth, which should be an analysis procedure.
- A single aggregate structure (file) is preferred over each instance (time series) being returned as a separate file (but see discussion below).
Other Comments:
- The DODS data model does the mapping from array indexes to time and/or depth if the indexes are defined in the COARDS manner for netcdf files, e.g. time(time) and depth(depth). Thus, V(0,… ) is retrieved as V(300,360,420,… ) if time(0.. ) = (300,360,420,… ). This is okay if time is single valued, however, many sites use EPIC conventions of time (Julian Day since year 0) and time2 (milliseconds after 00 hours GMT). In this case DODS uses time and ignores time2 and the independent variable is just the Julian Day, which for hourly data (say) would only change every 24 values. This problem may point to using the concept of an index, start time and time step to define the time of a measurement rather than defining time separately. If time is not defined as an array, V is retrieved just as an array V(0,… ).
- Implicit in some of the above is the use of the netcdf data model using arrays and attributes because of its widespread use for time series. However, provider sites may use other data models including hdf (which has an almost identical structure to nectdf for arrays), relational databases, sequential records (such as old NODC card records). However, in a superficial survey of DODS sites providing data from moorings, all use netcdf for the file-format to store their time series. The DODS data model will convert them to structured variables and sequences. Should the aggregation data model exploit DODS data model’s ability to use complex structures and nested sequences (see above or the JGOFS model for hydrographic cast data) or be restricted by the limitations of the netcdf array structures? If a model with complex structures is adopted, the translation work of the aggregation server software could become much more complex, and if the user’s analysis software (e.g. MatLab) expects netcdf-like arrays or vectors, then the user will require additional software to retranslate the more complex DODS sequences back into arrays. How practical should the aggregation data model structure be if users are to be encouraged to make use of the data?
- Widespread use of data in a common format (structure) provided by DODS aggregation servers may encourage the adoption of standards for processing software, which would be an improvement over the current chaos. Adoption of a theoretically appealing model may defeat this goal if the model differs too radically from the most commonly used data structures expected by time series analysis software.
- Most netcdf files of hydrographic data use a single file per cast, which leads to a huge number of files (e.g. WOCE Hydrographic Program http://whpo.ucsd.edu/ ). My attempt to provide multiple XBT and CTD casts from a single cruise in one structure (netcdf file) can be found at http://www.saicocean.com/. I do not know if this is useful to the user. Comparing the WOCE single cast netcdf file with SAIC’s cruise hydrographic file is perhaps an indication of the complexity added when data are aggregated. Is the increase in complexity worth the reduction in number of files and thus, simplifying the data inventory and the ability to organize the data? For example, if aggregate data are returned in a single file, it is already organized to some extent (i.e. by position, depth_level, etc.). If each individual time series of the aggregate is returned as a separate file, then the user has to do the sorting and organizing (a problem with many DODS sites). The use of cruise files versus cast files for hydrographic data, discussed above, illustrates this point. An alternate to complex aggregation files is to organize the individual components (single time series files) as entries is some appropriate relational database table and let the relational database organize the data. This would shift the complexity from the output files to the relational database tables.
4
Version 2.0
December 24, 2003