NASA CAN - Draft Proposal

18 Jan 2002, version 0.4

VOTable: A Proposed XML Format for Astronomical Tables

Daniel Durand, Canadian Astronomy Data Centre, Canada
Pierre Fernique, Observatoire Astronomique de Strasbourg, France
Robert Hanisch, Space Telescope Science Institute, USA
Bob Mann, Royal Observatory Edinburgh, UK
Tom McGlynn, NASA Goddard Space Flight Center, USA
François Ochsenbein, Observatoire Astronomique de Strasbourg, France
Alex Szalay, Johns Hopkins University, USA
Andreas Wicenec, European Southern Observatory, Germany
Roy Williams, California Institute of Technology, USA

1.Introduction

The VOTable format is a proposed XML standard for representing a table. In this context, a table is an unordered set of records, each of a uniform format. Each record is a sequence of (arrays of) primitive data types, together with metadata about the meaning of the data. The format is derived from the Astrores format [1], and backward compatible with that standard, except for (a) Fields are no longer allowed outside a Table, and (b) the Format attribute – used for automatic parsing of sexagesimal input – is no longer supported. Astrores was modeled on the FITS Binary Table format [2].

1.1.Example

A simple example of a VOTable document is:

<?xml version=”1.0”?>
<!DOCTYPE ASTRO SYSTEM " . ./VOTable.dtd">
<ASTRO ID="v1.0">
<DEFINITIONS>
<COOSYS ID="myJ2000" system="eq_FK5" equinox="2000." epoch="2000."/>
</DEFINITIONS>
<RESOURCE>
<TABLE name=”Stars”>
<DESCRIPTION>Some bright stars</DESCRIPTION>
<FIELD ID=”Star-Name” ucd=”ID_MAIN”
datatype=”A” arraysize=”10”</FIELD>
<FIELD ID=”RA” ucd=”POS_EQ_RA” ref=”myJ2000”
unit=”degrees” datatype=”E” precision=”5”</FIELD>
<FIELD ID=”Dec” ucd=”POS_EQ_DEC” ref=”myJ2000”
unit=”degrees” datatype=”E” precision=”5”</FIELD>

This table shows the positions of two stars, each with a name and two floating point numbers as coordinates. The star names have a fixed length of 10 characters, (shorter names will be padded by trailing blanks). The floating-point numbers (RA and Dec) are in degrees, and assumed to have five significant digits (precision=”5”), irrespective of the number of digits presented in the data. The frame of the coordinate system is specified explicitly with the COOSYS element.

1.2.XML

VOTable is constructed with XML (extensible Markup Language), a powerful standard for structured data throughout the Internet industries. It derives through simplification from SGML, which has been a standard in technical documentation for many years. XML consists of elements and payload, where an element consists of a start tag (the part in angle brackets), the payload, and an end tag (with angle brackets and a slash). Elements can contain other elements. Elements can also contain attributes (keyword-value combinations), such as the FIELD elements above.

The payload may be in two forms: parsed or unparsed character data. Examples are:

<text>François</text>
<text<![CDATA[ a <= (b & c) ]]</text>

In the first example, the sequence ç is interpreted as part of the ISO/IEC 10646 character set, and translates to an accented character, so that the text is “François”. The second example uses the special CDATA sequence so that the characters <, >, and & can be used without interpretation; in this case, any ASCII characters are allowed except the terminating sequence “]]>”. For more information, see any book on XML.

1.3.Syntax policy

The element names are in uppercase in order to help the reading. The attribute names are preferably in lowercase (with an exception for the ID attribute). Element and attribute names are further distinguished in this paper by being in fixed-width font.

1.4. Remarks about the ID attribute

VOTable uses the ID attribute defined by Xpointer standard in order to refer to other elements in the document. The attribute ID is a string beginning with a letter or underscore (_), followed by a sequence of letters, digits, or any of .-_:, and each ID must be unique in the XML document. For example ref="apple" refers to the element that contains ID="apple" in the current XML document. Elements that may have ID tags are ASTRO, COOSYS, FIELD, INFO, LINK, RESOURCE, TABLE, and VALUES. Elements that support the ref attribute (and can point to those with ID) are: CELL, FIELD, and TABLE.

The ID is different from the name attribute in that (a) the ID attribute must be unique (or else the document is considered invalid in the XML sense), whereas names need not be unique; and (b) There should be support in the parsing software to look up references and extract the relevant element with matching ID. It should be noted that this referencing mechanism will not work unless the parser uses a validating parser.

2.Semantics of a VOTable

In this section we define the semantics of a VOTable, and in the next sections its syntax. A table has two sections, metadata and data – see figure. The metadata describes the table itself (name, title, description, and an optional coordinate system), and the nature of each field (column) of the table is defined by the FIELD element. There may also be STREAM objects that are intended to connect either the table or its records to external data sources through local files, ftp, http, gridftp, or other protocols. The address of the remote object is written in the URL syntax, protocol://resource:port/file.

A Table in this context is illustrated below. The top line of the table is a class definition (metadata) for all the instances (also known as rows, or records) of data in the subsequent lines. The VOTable document may contain the data part of the table, or it may not. If it does not contain data, there may be a pointer to the data; this would be best if the data is large, as XML tools may become unreliable for very large data sets. Each row of the table is a set of instances of primitive types, such as float, int, doubleComplex, and so on – see table below for complete list. There may also be strings and blobs for holding binary content. These may have the same length in each row, or each instance may have a different length. The semantic meaning of a blob (eg.”This is a JPEG image”) is not defined by VOTable, but it may be written into the description or name attributes, or the ID mechanism discussed above.

Each FIELD (or column) of the table is defined by the nature of the primitive data, and by name, description, units, and info attributes. There is also a Unified Content Descriptor (UCD), which is a reference into a glossary created at CDS Strasbourg. Another attribute is the precision, which expresses the implied accuracy (number of significant digits) of each datum in this column.

The list of FIELD elements (or column definitions) can be thought of as a template for the records (or rows) of the table, which follow in the DATA section. The records are fundamentally unordered, meaning that a table with the records in a different order is equivalent to the original. Ordering of records is a presentation property of the data rather than a structural one.

We should note that a VOTable document may be used to express a question as well as an answer. Suppose there is a table that has no data – it has all the metadata (header) fields, as above, but no actual data rows. Then we could think of this document as a form that is to be filled in, as a request for data; the specification of class as an implicit request for instance.

2.1.FITS Binary Tables

VOTable is completely compatible with the FITS Binary Table format. The semantics of any FITS binary table file may be completely represented with VOTable. The metadata for the FITS file may be converted to VOTable, and the FITS file pointed to by the VOTable.

3.Metadata Content

The Table is written in XML as a TITLE, DESCRIPTION, LINK elements, that describe the nature of the data in the table. The LINK element may be parsed (see section 3.4). There is may be a COOSYS element, that contains specific information on the astronomical coordinate system that is being used. The rest of the metadata describes the FIELDs that together make up each row of the table.

A FIELD element may have several sub-elements, including the informational TITLE, DESCRIPTION, and LINK, as well as VALUES, that can express limits and ranges of the values that the corresponding cell can contain, such as minimum, maximum, or enumeration of possible values.

The FIELD must contain a datatype attribute, which expresses the nature of the data that is in the cells of this column of the table. This determines how data is read and stored internally. If it is not present, an exception is thrown.

Each table cell may contain more than one of the specified datatype, and this is specified with the arraysize datatype. The default value of this attribute is generally 1, meaning a single value in the table cell. In the case of the Bit datatype, the length represents the number of 8-bit bytes that are used. Character strings will be padded with null characters if they are shorter than the specified length.

Unicode is a way to represent characters that is an alternative to ASCII. It uses two bytes per character instead of one, it is strongly supported by XML tools, and it can handle a large variety of international alphabets. Therefore VOTable supports not only ASCII strings (datatype=”A”), but also Unicode (datatype=”U”). For backward compatibility with Astrores, the default size of these may be given by the “width” attribute (see section 3.1) if it is present: for datatype=”A”, arraysize defaults to width (or 1 if not present), and for datatype=”U”, arraysize defaults to 2*width or 2.

Variable-size arrays are also supported through the attribute called arraytype. By default, this has the value “fixed”, and the array size is given by the arraysize attribute. If arraytype=”variable”, however, the corresponding table cells can contain a variable-width array. For example, a JPEG image could be associated with each row of the table by using datatype=”B” and arraytype=”variable”. However, it should be pointed out that the processing of uniform-length strings and blobs will be much more efficient that that of variable-length, although the storage efficiency can be much greater with the variable-length mode.

For details of the exact meaning of these data types, please see section 7.

If the data is written as TABLEDATA or the CSV forms, there may have an attribute to define the handling of arrays and complex numbers. If a CELL contains an array or complex number, it should be encoded as multiple numbers with a separator character between them. This character may be defined by the arraysep attribute. The default value for this is a blank. However in the case of character and Unicode strings, no separators are required.

3.1.Numerical Accuracy

The VOTable format is meant for transferring, storing, and processing tabular data, it is not intended for presentation purposes. Therefore (in contrast with Astrores) we generally avoid giving rules on presentation, such as formatting. However, we retain the “width” attribute of the FIELD, which is meant as a hint to the presentation system about the number of characters to use for input or output of the quantity.

But there is a semantic difference between a number written as “5.12” and one that is written “5.1200”. In that the former implies three significant digits of accuracy, and the latter five digits. Therefore the number of digits to show is not purely a presentation matter, but part of the metadata content of the number.

VOTable therefore provides the precision attribute in the FIELD element to express the number of significant digits, or equivalently, the log of the implied error estimate of the numbers in the column. More control is available through an initial character: setting this to “E” rather than the default “F” implies that the precision measures is relative error (significant figures) rather than absolute error (decimal places). Thus precision=”E5” means an implied relative error 10-5, and precision=”5” or “F5” means an implied absolute error 10-5.

3.2.Units

The quantities in a column of the table may have physical units, and this is specified by the units attribute of the FIELD. Examples are:

units=”cm-2.s-1.keV-1”
units=”erg.s-1”

The syntax of this string is defined in reference [3].

3.3.Unified Content Descriptors

The CDS in Strasbourg has used the metadata from thousands of astronomical tables to make a hierarchical glossary of the scientific meanings of the data in those tables [4]. Of 1600 entries in the glossary, here are a few typical examples.

PHOT_INT-MAG_BIntegrated total blue magnitude
ORBIT_ECCENTRICITYOrbital eccentricity
STAT_MEDIANStatistics Median Value
INST_QEDetector's Quantum Efficiency

The ucd attribute of the FIELD is to hold this information.

3.4.VALUES element

The VALUES element of the FIELD is designed to hold subsidiary information about the nature of the data in the field. It may have MIN and MAX elements, and it may contain OPTION elements. The latter contains name and value attributes, and may also contain more OPTION elements, so that a hierarchy of keyword-values pairs may be associated with each field.

There may also be a null attribute. If this is present, and a table cell takes this value, it is assumed to mean that no data is present. For example, there may be a convention that missing values in a table are expressed with –99, in which case the “missing” table cell would be set to this. Therefore any cell in this field with this value is assumed to have no data.

There may also be an attribute called “invalid”, meaning that this value should be used in case a table cell cannot be read. If, for example a row of a table should be all integers, and its CSV representation is:

34, 3w4, 45, 11, ---, 76

In this case, the unparsable values “3w4” and “---“ will cause an exception to be thrown, unless the relevant field definition contained something like:

in which case the cells with the bad text would both contain the integer –1 instead. This will allow a VOTable parser to act as a debugging tool for very large tables that may have a few bad data elements.

3.5.LINK Elements as URL Templates

The LINK element is to provide pointers to other documents or data servers on the Internet through a URL. In Astrores, the LINK element may be part of the RESOURCE, TABLE or FIELD elements. The href attribute of the LINK is meant to provide a URL that is at least valid syntactically, even though there need be no assurance that the link will actually connect and deliver data. It may be that a strange protocol is implied that the parser does not know about, for example gridftp://server/file. However, parsers are expected to understand at least the file, http and ftp protocols.

The gref attribute is meant for a higher-level protocol of some type, perhaps a logical name for a data resource, perhaps a GLU reference [5].

In some cases, there is additional semantics for the LINK element, where the href and gref attributes are not a simple URL, but rather a template for creating URL’s. Depending on the content-role attribute of the LINK, and the nature of the parent element, the ID tags from the table may be substituted into the template to create an implicit new column, as explained in the next section.

3.5.1.Pattern-matching and Substitution

When a LINK element appears within a TABLE, there is extra functionality implied. The href or gref attributes may not be a simple link, but instead a template for a link. For example, in the table of section 1.1, we might have:

<LINK href=”

The implication is that the text is seen in the context of a particular row of the table, and a substitution filter is applied. If the selected row of the table is the first one, the result of the substitution would be:

Whenever the pattern ${…} is found in the original link, the part in the braces is compared with the set of name attributes of the fields of the table. If a match is found, then the value from that field of the selected row is used in place of the ${…}. If no match is found, no substitution is made. Thus the parser makes available to the calling application a value of the href and gref attributes that depends on which row of the table has been selected. Another way to think of it is that there is not a single link associated with the table, but rather an implicitly defined new column of the table. This mechanism can be used to connect each row of the table to further information resources.

The action attribute in this release of the standard is simply a string. In a future release, it may gain an implied string substitution filter as with href and gref.

The purpose of the link is defined by the content-role attribute. The allowed values are query, hints, and doc. The first implies that string substitution should be used as defined above, and the latter two imply first that no substitution is needed, and that the link points to either information for use by the application (hints) or human-readable documentation (doc).

3.6.Type Attribute

The type attribute of the FIELD may carry values that express the status of the field when the enclosing table is a query, rather than a data document. If the value is “noquery”, then the marked field is ignored in the creation of the action query – this field does not belong to the form described by the set of FIELDs. A computed column (value computed from other FIELDs) is a typical example.