1Dataprocessor

In many cases data coming directly from an ESSE data source need to be modified before use in the fuzzy search or visualization engine. For example, the NCEP Reanalysis database has East and North wind speeds (U- and V-wind parameters), but the absolute wind speed and wind direction are needed. Another example is a frequent temperature units conversion task between degrees of Kelvin, Celsius or Fahrenheit. Applications can need diurnal temperature variance, minimum and maximum temperatures over some time period, and so on. Thus we need a special component which can be programmed to calculate a new time series from several inputs.We call this component a data processor. For the moment our data processor can be programmed with expressions, which operands have different time stamps but the same spatial coordinates. In other words, we can process several time series along the time axis, but in the future we plan to extend the data processor with spatial functions as well.

As an input the data processor receives one or several time series and a set of instructions (program) to process the data. The set of instructions can include:

  • arithmetic expression
  • elementary functions
  • moving average
  • seasonal variations
  • time shifts
  • elementary spatial operations

The complete list of instructions supported by our data processor is presented in Appendix 10.4.

1.1Datamodel

To exchange data between components of the data processor we use an abstraction calledESSEDataModel. UMLdiagram of the data model is shown in Fig. 26.

Figure 26. UMLclass diagram of the abstractESSE data model

After analyzing the requirements to the data processor, we have identified the following component list needed in the abstract data model:

Dimension:

points = pointNumber //пространственноеизмерение

vector = intervalNumber //числоинтервалов

data = dataLength //количество наблюдений в каждой пространственной точке

Variables:

float lat (points)

float lon (points)

floatdata (points, data)

longtime (vector) //(в дальнейшем эта переменная будет определена на измерении data)

intinterval (vector )

In fact, the abstract ESSE data model is one of the three data models used by the data processor. TheothertwoareEssePackand its XML serialization. EssePackis a lightweight data model used for data exchange between different ESSE modules (see the Component interplay section below). EssePack has the following structure:

private String dayId; // starting date (yyyy-mm-ddTHH:mm:ssZ)

private float[][] data; // data for given grid points ([grid point number][sample number])

private float[][] grid; // grid array (Ex. lat/lon coordinates of grid points [grid point number][0 or 1 or 2] )

private int sampling; // data sampling (in seconds) for constant sampling

private int[][] times; // time shift (in seconds) since dayId, It equals null for constant sampling

private String comment; // used for general purpose

private String location; // used for general purpose (station ids)

private String[] stationCredentials;

The EssePack structure is appropriate for simple data exchange. It is used as the standard output format by all ESSE data source. Binary input and output streams of ourOGSA-DAIactivitiesare also formatted as EssePacks.

XML is another widely accepted format for data exchange. We have developed an XML schema to serialize and de-serialize binary data in EssePack format to XML using alternating <v> and <t> tags for the data and time values with a metadata header in the beginning of the XML document. The XML data document example:

<DataSet>

<DataParameter>

<grid>

<point>55.0 37.5</point>

</grid>

functionalKey

<!--short data processing history-->

</functionalKey>

<v>1.3297071 </v<t>2005-01-01T00:00:00UTC</t>

v>1.0548935 </vt>2005-01-01T06:00:00UTC</t

<!--and so on-->

</DataParameter

</DataSet

The serialization and de-serialization between XML and binary formats takes some time for large data files. In addition, the representation of numeric values as text is not optimal for file sizes and requires additional compression (e.g. GZIP). That is why we developed a special abstract data model to be used inside the data processor module.

1.2Methods

Thenextstepafterthedatamodelwaschosenis to design the architecture and logic of the data processing kernel. Here we use the same approach as in the fuzzy search engine: we represent the data processing expression as an XML-formatted calculation tree, which can be parsed into a DOM tree by any standard XML parser. Details on the syntax of the data processing expression are in Appendix 10.4.

The main difference of the data processor calculation trees from the fuzzy search engine is the number of functions. In addition, the calculation tree schema must be easily extensible for the new functions and operations. To write a new data processor module for every new function is very inefficient, so we have defined several classes of functions with similar data processing algorithms. One of such classes is a set of functions with step-by-step processing of the time series, for example to find minimum, maximum or average of observations from every location at the same time.

Every function class in the data processor has its own handler. In a loop the handler visits its arguments, sequentially extracts time intervals, and for each time interval performs a needed processing step by a function call to another package called mini-processor (for example, a call to calculate a minimum of float arguments). All data handlers are located in the package ru.wdcb.esse.dataprocessor.EsseDataFunctions. The mini-processor functions are located in the package ru.wdcb.esse.dataprocessor.SimpleFunctions. In case of calculation error the DataProcessorException is thrown. Iterator of theXML calculation tree is located inDataProcessActivity.

1.3Architecture

The first package,ru.wdcb.esse.cdm,is the implementation of theEsseDataModel. It includes classes that implement entities ofabstractdatamodelofESSE and several helpers: EsseDatais the root data storage, Dimension, Attribute, Variable, ArrayVariable, RangeVariableare the entities of data modelabstractdatamodel, DataTypeis the type descriptor. We have not implemented theVirtualVariable. This package, as well the common data model itself, is still in its “alpha” version. The second package,ru.wdcb.esse.dataprocessor, was described in the previous section.

The data processing is done as follows. TheDataProcessingActivityinput has XML calculation tree and the output data format. The XML-parser creates a DOM tree from the XML input and validates the XSD-schema. TheDataProcessingActivityiterates over theDOM tree and calculates the function values in its nodes. On terminal leaves of theDOM tree are raw data requests to the activity inputs. TheinputdataisloadedintotheEsseData object, which implements the abstract data model. Whencalculationsarefinished, the result from the new EsseData objectis converted to NcML XML format.

1.4Use cases

In the end of this chapter we present several use cases from the real applications, where we have used the data processor to calculate wind speed, diurnal temperature variation, temperature units conversion, and search for maximum (minimum) parameter value over a certain time interval.

User case 1. Daily average wind speed

Wehavetocalculate, whereиare the daily averages of theU- andV-wind components respectively.

We submit the following XML calculation tree to the DataProcessingActivity:

dataProcess name="dataProcess">

<attribute name="seasonal" value="false"/>

<function name="sqrt">

<function name="sum">

<function name="power">

<function name="average">

<function name="createInterval">

<attribute name="length" value="86400"/>

<attribute name="boundaryTime" value=""/>

<input name="x"/>

</function>

</function>

<attribute name="additionalArgument" value="2"/>

</function>

<function name="power">

<function name="average">

<function name="createInterval">

<attribute name="length" value="86400"/>

<attribute name="boundaryTime" value=""/>

<input name="y"/>

</function>

</function>

<attribute name="additionalArgument" value="2"/>

</function>

</function>

</function>

<outputFormat name="xml"/>

<output name="dataProcessingOutput"/>

</dataProcess>

Herexandyare the names of the outputsgetXmlDataActivity, which return from the ESSE data source raw data forthe U- andV-wind components.

However, iftheU- andV-windcomponentswere measured at the same time, then the daily average wind speed can be calculated by the formula, whereNis the number of observations per day. In this case we have to move createIntervalandaverageto the top of the calculation tree.

User case 2. Diurnal temperature variance:

dataProcessname="dataProcess">

<function name="sub">

<function name="max">

<function name="createInterval">

<attribute name="length" value="86400"/>

<attribute name="boundaryTime" value="…"/>

<input name="x"/>

</function>

</function>

<function name="min">

<function name="createInterval">

<attribute name="length" value="86400"/>

<attribute name="boundaryTime" value="…"/>

<input name="y"/>

</function>

</function>

</function>

<outputFormat name="xml"/>

<output name="dataProcessingOutput"/>

</dataProcess>

Use case 3. Temperature units conversion from Celsius to Kelvin degrees:

<dataProcess name="dataProcess">

<function name="add">

<input name="x"/>

<attribute name="additionalArgument" value="-273.15"/>

</function>

<outputFormat name="xml"/>

<output name="dataProcessingOutput"/>

</dataProcess>

Use case 4. Search for the maximum parameter value:

<dataProcess name="dataProcess">

<function name="max">

<function name="createInterval">

<attribute name="length" value="86400"/>

<attribute name="boundaryTime" value="…"/>

<input name="x"/>

</function>

</function>

<outputFormat name="xml"/>

<output name="dataProcessingOutput"/>

</dataProcess>