Querying XML

Documents on the

Web with WQL

Applying the

XML-QL Engine

I. Extensible Markup Language (XML)

1.1 Overview

1.2 XML Data Model and Syntax

1.3 Document Type Definition, DTD

1.4 Extensible Stylesheet Language (XSL)

II. The Web Query Language (WQL)

2.1 Data Model

2.2 WQL Query Structures

2.3 Structuring the Output in WQL

III. XML Query Language (XML-QL)

3.1 Data Model and Syntax

3.2 Types of Queries

IV. Implementation of WQL Queries with Relational Output

4.1 Structure and Syntax applied for implementation

4.2 Query examples

4.3 Transforming a WQL Query with Relational Output to XML-QL

VI. Structured WQL Output Implementation

5.1 Transforming WQL Query to XML-QL

5.2 Merging Structured Output

5.3 Implementation of Merge in Structuring tree

VII. Displaying The Result of a WQL query

6.1 Reformatting Output for display

6.2 Generating Stylesheet (XSL) File

6.3 Implementation

Introduction

This thesis discusses querying web documents written in XML (Extensible Markup Language), the relatively new web standard for data and database documents. We use the application of WQL (Web Query Language) for queries. Implementation is built over the XML-QL (XML query language) query engine. Formatting techniques are applied to display resulting XML document with the web browser MSIE5+ (Microsoft Internet Explorer 5 or greater). For now, the current version of MSIE is 5.5. Two programs are written in C programming language. The first converts WQL queries to XML-QL for running them in UNIX environment. The second transforms XML-QL output and generates stylesheet to display the resulting XML document.

We begin with an overview of XML, and show why XML, its extensions, and like languages tend to replace HTML in web documents . We give a brief description of XML Data model and some details of XML components: DTD, Document Type Definition, and XSL, Extensible Stylesheet Language. Next, we describe WQL Data Model, as the combination of a tree representation and predicates, and show WQL query and output structures. XML-QL tools are chosen for executing WQL queries. Therefore, a brief explanation of XML-QL Data Model, syntax and query structures are given then.

The last three chapters include implementation of WQL queries, formatting, and displaying results. We start with observation of implementation of WQL queries with flat, relational, output. We give a transformation algorithm to run WQL queries over the XML-QL engine, where WQL queries, at first, have to be converted to XML-QL queries. Next, we discuss implementation of WQL queries with a structured output. In addition to transformation, we apply merging techniques for resulting XML document. A merge algorithm is represented then. Finally, we give the algorithm and examples of implementation of displaying XML output document (WQL query result), that is generation an XSL stylesheet.

We conclude with a discussion of possible future work on WQL applications, which use XML-QL or other XML query tools.

XML

Overview.

Extensible Markup Language, XML, is a relatively new standard of The World Wide Web Consortium (W3C) for data handling and exchange on the Web. XML was designed specifically to describe content, rather than presentation. The basic component in XML is the element, that is a character or a string of characters bounded by matching tags such that <author>(start tag) and </author> (closing tag). Tags in XML are defined by users. A structure in between tags is called an element content. That can be data, sub elements, or both. In addition, XML element can have attributes inside start tag. There are differences between attributes and tags. Only one occurrence of a given attribute is allowed within a tag, while sub elements with the same tag can be repeated. For example, <author id=“01" href=“http://hake.stanford.edu/~ullman">Ullman<book>DBS1</book><book>DBS-2</book></author> is the author element with id and href (link) attributes which values are “01"and “http://..." respectively, with the repeated sub element book, and with the data content “Ullman".

XML allows querying data on the Web in a more efficient way than with HTML. HTML doesn't have many facilities to define semantics. HTML tags are general and independent on data content. For instance, many HTML documents represent tables with visible or invisible borders, and table tags <table>, <tr> (table row), <td> (table cell) are widely used for wrapping data elements. Such tags do not help users or software to define data semantics. In turn, XML tags, defined by a user, have meaning and help querying XML documents. Let us start with examples of HTML and XML files modified for simplification from real files on the Web. Throughout the thesis we will use examples from ACM SIGMOD page http://www.acm.org/sigs/sigmod/record/xml/which has both HTML http://www.acm.org/sigmod/record/previous.html and XML versions http://www.acm.org/sigs/sigmod/record/xml/Record/HomePage/index.xml .

Data part of the index page, without pure formatting instructions in HTML represents year, month, volume, number, name of the conference, and link:

<html>

<table >

...

<tr>

<td><1999</td>

</tr>

<tr>

<td><a href="/sigmod/…/index.html">

March</a></td>

<td>Vol 28, No 1</td>

<td>Semantic Interoperability… </td>

</tr>

<tr>

<td>1998</td>

</tr>

<tr>

<td><a href="/sigmod/9812/index.html">

December</a></td>

<td>Vol 27, No 4</td>

<td></td>

</tr>

<tr>

<td><a href="/sigmod/record/issues/9809/index.html">

September</a></td>

<td>Vol 27, No 3</td>

<td></td>

</tr>

...

</table>

</html>

The same data part in XML (almost complete XML file) is the following:

<HomePage>

<yearListTuple>

<year>1999</year>

<numberListTuple>

<toIssues href="../OrdinaryIssuePage/281.xml">

<month>March</month>

</toIssues>

<volume>28</volume>

<number>1</number>

<conferenceName>Semantic Inter...</conferenceName>

</numberListTuple>

</yearListTuple>

<yearListTuple>

<numberListTuple>

<toIssues href="../OrdinaryIssuePage/274.xml">

<month>December</month>

</toIssues>

<volume>27</volume>

<number>4</number>

<conferenceName></conferenceName>

</numberListTuple>

<numberListTuple>

<toIssues href="../OrdinaryIssuePage/273.xml">

<month>September</month>

</toIssues>

<volume>27</volume>

<number>3</number>

<conferenceName></conferenceName>

</numberListTuple>

</yearListTuple>

... …

</HomePage>

It is obvious that tags in the second example describe contents of data and will be helpful for querying software. Even different (not table) HTML tags do not say anything about their content except formatting instructions. On the other hand, HTML document itself is ready to display data (in our example, most of the formatting has been cut off for better data part comparison), and XML document needs additional instructions to be displayed. Style sheet files play a presentation role for XML documents and exploit HTML features. Two formats are currently used for these purposes: XSL (Extensible Stylesheet Language), and CSS (Cascading Style Sheet). XSL is taken as basis for our representation and discussed in some details further in this chapter. Comments in XML documents are bounded by `<!--' and `-->' symbols and can inserted in any place of a document in form of:

<!-- comments can be here -->.

XML Data Model and Syntax

XML document consist of Root and Elements which can include elements, data, mix of both, or can be empty. In addition, each element can have one or more Attributes, defined by name and value. Well-formed XML document must start with line <?xml version="1.0"?>, declaring current version of XML, and conform the following:

1. should have at least one element

2. must contain a unique root element

3. all elements inside document must be properly nested.

Where an element has the form <element name> element content </element name>, and <...> is an opening tag (start-tag), </...> is a closing tag (end-tag), both are required. Now let us look at XML data model applying the previous example, that can be represented as a tree structure on Figure 2.1. It is a node labeled graph with edges connecting nodes. It has one root element Home Page. The tree terminates at leaves, elements which have or are supposed to have only a character string as a child.

The structure of an XML document can be discussed using DTD (Document Type Definition)

Document Type Definition, DTD.

A Document Type Definition, DTD, serves as grammar for XML document and can be used schema. DTD describes document structure and hierarchy. Given DTD for our previous example on Figure 2.1 is written as follows:

<!DOCTYPE HomePage[

<!ELEMENT HomePage (yearListTuple)+>

<!ELEMENT yearListTuple (year, numberListTuple+)>

<!ELEMENT year (#PCDATA)>

<!ELEMENT numberListTuple (toIssues, volume,

number, conferenceName)>

<!ELEMENT toIssues (month)>

<!ATTLIST toIssues

xml:link CDATA #FIXED 'simple'

href CDATA #REQUIRED>

<!ELEMENT month (#PCDATA)>

<!ELEMENT volume (#PCDATA)>

<!ELEMENT number (#PCDATA)>

<!ELEMENT conferenceName (#PCDATA)?>

]>

The first line (and the upper ELEMENT declaration) says that <HomePage> is the root element. The first and the last lines are necessary only if DTD is declared inside XML document. If DTD is located in a special file with “dtd" extension we don't need those lines but DTD has to be linked from XML file as <!DOCTYPE HomePage SYSTEM `location of DTD file’>. Each element is declared as <!ELEMENT tag (contents)>. If element is not empty inside parenthesis can be elements, data(PCDATA) means parsed character data) or mix of them. Otherwise, an element name must be followed by EMPTY keyword. For declaration of an element with any content the keyword “ANY" must follow the element name. Special characters applied as follows: `,' for ordering; `\mid' for selection (one of…); `?' one or none; `+' for any number of elements but at least one; `*' for any number including none; and parenthesis for grouping.

ATTLIST is followed by an element name and its attribute or attributes. In attributes declaration we use CDATA as abreviation of character data. In our case, the attribute is a link, and href is the reserved keyword for the linking specification. I turn, xml:link is the reserved XML keyword as an attribute to define links, it can take the value “simple" or “extended". When the xml:link attribute's value is #FIXED, xml:link does not need to be included in each instance of the element, it is implied in each toIssues tag, and we can use:

<toIssues href=“../OrdinaryIssuePage/273.xml">.

Finally, a well-formed XML document including DTD, where every element type in the document is declared in the DTD is called a valid document. We use the valid XML document notation further in the text.

Extensible Stylesheet Language (XSL)

Extensible Stylesheet Language, XSL, consists of two parts: the first, XSL Transformations (XSLT), a language for transforming XML document to another XML (or other formats) document, and the second, an XML vocabulary for specifying formatting semantics (XSL Formatting Objects), formatting language. When an XSL processor accepts a document in XML format and an XSL stylesheet, it produces the presentation of the document in two steps: first (tree transformation), constructing a result tree from XML source, and second (formatting), interpreting the result tree to produce formatted results for presentation, (i.e., on a computer display, printer or other media).

Formatting is made by insertion of formatting semantics in the result tree in form of formatting objects(f.o.): <fo:f.o.declaration f.o.property= ``value">…</fo:..> For instance,

<fo:block color=``red"font-size=``16pt"> …</fo:block> provides red font 16pt for string pattern inside element (formatting object) fo:block. The transformation and formatting parts in XSL can function independently of each other. For our application we use this approach where an XSLT can transform an XML document into well-formed HTML file completely ignoring the XSL formatting objects. The latter style of the XSL

(without applying formatting objects) is supported by MSIE5+.

A Data Model for XSL is similar to XML (and XSL stylesheet is written in XML). It is a tree with a unique root node representing the whole document, but it is different from the root of the corresponding XML document because it appears to be the parent of the XML document root. The reason to make an additional parent node is that XSL processor treats comments (which can, actually, be located before the root in XML source document) as elements.

An XSL file (XSL program) is a set of template rules. Each rule consists of a pattern and a template. XSL starts from the root element and tries to apply a pattern to each node. If it succeeds, it executes the corresponding template and repeats the process for the node's children. For example, the following is an XSL program (transformation rules):

<xsl:template>

<xsl:apply-templates/>

</xsl:template>

<xsl:template match="HomePage/*/conferenceName">

<Cname>

<xsl:value-of/>

</Cname>

</xsl:template>

The first three lines mean that the whole program is applied to all elements (there is no match attribute for xsl:template element). The rest of program finds all matches of conferenceName and tag (element) names in arbitrary depth in the document and returns found element values framed by Cname open and closing tags. If this XSL program is applied to the XML document of Section 2.1, the output will be:

<Cname>Semantic Interoperability...</Cname>

In the second half of the XSL program above the xsl:template element with the attribute match can be replaced by the <xsl:for-each> element with the attribute select to produce the same result. We use the latter syntax feature in XSL generation discussed in Chapter VII.

WQL

Data Model

The Web Query Language, WQL, proposed by Lakshmanan, Sadri, and Subramanian, is a logic-based language for querying web documents as structured or semistructured data sources and representing the query results in different formats. WQL syntax is similar to Prolog and language includes the alphabet, variables, function and predicate symbols. Data model includes a Schema Tree, a Schema Graph, and an Instance Tree. The Schema Tree is a tree T=(N,E), where N is the set of nodes which are actually elements from DTD (if we apply WQL to valid XML documents) or elements from well-formed XML document, and E is set of edges which represent direct relations between Parent element (node) and Child element (node). Therefore, the Schema Tree can serve as a tree representation for DTD of a particular XML document or the document itself. In turn, node labels (concepts) correspond to element (tag) names. The previous example with DTD from Section 2.3 can be expressed in Figure 3.1.

Note that nodes with set values are indicated by `+'.

Schema tree

Schema graph

While the schema tree is a model for documents using the same DTD, a Schema Graph represents a collection of documents with different DTDs connected by hyperlinks. The Schema Graph is a graph G(T,L), where T is a set of schema trees, and L is a set of special labeled edges, links between trees as representatives of connected documents). Addition new tree with the root OrdinaryIssuePage to the tree from Figure 3.1 via hyperlink produces a schema graph shown on Figure 3.2.

The Instance Tree is a tree TD (ND, ED), where ND is the set of nodes and ED is set of edges, for a given document D. Intuitively, if TD represents a hierarchical structure of valid XML document D, then the schema tree represents the DTD for this document. Figure 2.1 represents the instance tree of an XML document corresponding to the schema tree of Figure 3.1.

A query, a WQL rule, consists of a body and a head, where the body can be represented as one or more predicates or/and one or more template trees. The head can be a predicate for a flat output or a structuring tree. The Template Tree, T=(N,E), is a set of nodes and edges. It represents a pattern which can be built of DTD for XML documents. The Template Tree differs from Schema Tree in two aspects. The first, it uses selected nodes needed for a query, therefore it is partial tree of the Schema tree. The second, it has additional features for each node. The node is specified by its id, URL, concept, and data, which can be variable, empty, or (except id) constant. The edge connects two nodes inside a document or represents a link (links) to another document. The edge label `*' shows a path with length >= 1, where the path is an edge or a sequence of edges between two nodes. A set of template trees T, with the set of links L, connecting trees is a Template Graph (T,L). The links , local or remote, are special labeled (by path expression) edges. And the last formal notation, WQL atom, which defines the query body content, consists of predicates and template graphs. The next section shows details of the WQL query and query body structure.