Data storage and delivery on the web and the role of XML

Sigrun Ragnarsdottir, March 1999

Table of content:

Introduction......

XML overview......

Benefits from XML......

XML Vocabularies......

XML structure......

Logical Structure......

Physical Structure......

XML syntax......

Database Solutions using XML......

Flat file Solution......

Relational Solutions......

Object Oriented Solution......

Hybrid Solution......

Virtual database......

Conclusions......

References......

Table of figures:

Figure 1: The markup evolution......

Figure 2: The 'X' series......

Figure 3: XML structure......

Figure 4: XML namespaces......

Figure 5: Example of a XML document syntax......

Figure 6: Flat file solution......

Figure 7: Relational Middle-ware......

Figure 8: OO database......

Figure 9: Textual database......

Figure 10: Virtual database......

Introduction

The purpose of this survey is to address the role of XML (eXtensible Markup Language) in web technology. XML has been put forward as the future markup language of the web. It has simpler syntax than SGML and is more expressive than HTML.

This survey starts with an overview of XML and then moves into description of how it can be used in combination with different database technology to manage and deliver the data. It addresses several topics like:

  • Using XML to access data.
  • Delivering data using XML.
  • What kind of storage management to use (flat, OO, Relational, hybrid) and what to store.

A lot of companies use relational databases and want to continue using them. Other companies want to explore new solutions. Each database technology has an effect on performance, storage and flexibility of the system. This survey does not take into consideration optimizations for different technologies nor does it consider different algorithms.

XML overview

Figure 1: The markup evolution

HTML was created to display data. The need for a relatively simple, flexible and yet powerful ways to exchange, describe and deliver data for the web was needed. The fact that XML is an open standard developed by the W3C ensures that it is uniform and not owned or dominated by any single vendor. The power of XML is its separation of view and data, which makes integration of data from diverse sources easier.

HTML is used to view the data but doesn’t have the flexibility of XML. The attraction to XML is the X (eXtensible). The learning curve for SGML is too high for easy creation of structured or semi-structured documents. A lot of work has already been done in SGML, which hopefully is transferable to XML[i].

Figure 2: The 'X' series

To explore XML beyond using it as a data format we have to mention three related, but different standards:

  • XML (eXtensible Markup Language). Describing the markup standard. Structured data in XML can be self-describing with either DTD (Document Type Definition) or with XML-Schemas. They provide a description of the structure of the data, if it doesn’t already have a built in description.
  • XSL (XML Style-sheet Language, eXtensible Style Language). Describing how the arbitrary tags of an XML document should be presented under certain rules; they can hide some elements and display others. XSL provides a superset of CSS (Cascading Style Sheets).
  • XLL (eXtensible Linking Language). Describes how to link from a linking element to a resource. XLL takes HTML linking a bit further and is divided into Xlinks for linking and Xpointers for addressing individual parts and/or range or spans of a document.

In addition, several standards for exchange of meta-data have been suggested, like by Microsoft (XML-Data), Netscape (MCF) and W3C (RDF).

As a reference for XML I looked at Microsoft’s web site[ii], got declarations from OASIS[iii] and books by Harold[iv] and Bradley[v].

Benefits from XML

XML will contribute most for the following situations[vi]:

  • Virtual databases, where the client has to mediate between two or more heterogeneous databases. XML can use vocabularies to add meta-data or meta-content to HTML making data exchange easier.
  • Applications that try to move the workload from the server to the client. After delivery from server, data might be parsed, edited or manipulated locally. Data can be updated granularly if an entity changes. View of the data can be changed locally without re-transmitting data from the server. This results in enhanced server scalability because of far lower workload. Due to repetitive nature of the tags, XML compresses extremely well. This reduces the bandwidth needed to send data.
  • Applications where different view of the data is required. After data delivery from server, it can be viewed in a different way determined by configuration, preferences or other. Many style sheets can be defined for a document or many documents can use one style sheet.
  • Applications for more meaningful searches. XML will provide flexible way to markup data and therefore it will open up the ability to do more intellectual searches. Instead of looking for if and how certain words appear in a document, search can be defined as search only in specific field of a document. A good example is searching for ‘Clinton’ brings up ‘Clinton, South Carolina’. A more meaningful search would be to search for Clinton under president/people field.

XML Vocabularies

Several XML vocabularies have been defined and some agreed upon. Vocabularies can be defined without special browser or plug-in support, which makes them very powerful. Few examples of vocabularies:

  • CDF (Channel Definition Format). Describes push technology. Can be used to describe how pages should be downloaded and usage of web pages.
  • OSD (Open Software Description). Describes software packages and their dependencies. Can be used to notify users of new versions of software components and how to install software components, over the Internet.
  • OFX (Open Financial eXchange). Describes transfer of financial data. Can be used in communication with financial institutions over the Internet.
  • EDI (Electronic Data Interchange). Describes how data should be exchanged, regardless of the computing systems or accounting applications being used.

XML structure

Figure 3: XML structure

Logical Structure

A big benefit of XML is the ability to create a document, so that elements and their attributes can be managed and validated. The degree to which an element’s content is organized into child elements is called granularity. ‘Fine granularity’ is when there are many descendants of an element as opposed to ‘coarse granularity’. Some hierarchical structures may be recursive, like for example a list. An element that directly or indirectly contains instances of itself is called a nested element.

It is possible to pre-define which elements are allowed within other elements, defining the rules and structural relationship of the document. An optional DTD contains rules for each element allowed within a specific class of documents.

An element consists of a start-tag, data and an end-tag, in that order. Every tag must have a name. An element may contain further, embedded elements in its data. A ‘well-formed’ document consists of properly embedded elements and syntax. A ‘valid’ document has to conform to a DTD. All valid documents are well-formed documents.

It is possible for an element to hold information about its content beyond just its name. This ‘information about information’ is termed meta-data, and is stored in an attribute. An attribute consists of name and a value, separated by ‘=’. When a DTD is not in use, the attribute value is simply considered to be a unit of text.

Figure 4: XML namespaces

Since in the XML standard, a document can only be part of one DTD, namespaces ensure that element names do not conflict and clarify their origins.

Physical Structure

An XML document consists of entities. An entity is a data like a file, a database record or other data item. That means that document data can be distributed, favoring data re-use, both internally and externally. An entity consists of a declaration and value, except for the document entity and parameter entity.

XML syntax

Figure 5: Example of a XML document syntax

XML resembles HTML with few exceptions:

  • Every tag must be closed.
  • No overlapping elements.
  • XML is case-sensitive language.
  • Empty tag is delimited using ‘<’ and ‘/>’.
  • Allows unlimited set of tags, which describe the data, not the display.
  • XML does not ignore white spaces.
  • All attribute values must be in quotes.

Database Solutions using XML

Now we know what XML is. The question is, how should it be used in connection with databases. We will consider four different storage solutions and how data can be delivered using them.

As a reference I looked at how SIM talks about using different databases[vii], POET talks about their OO database[viii] and survey about data delivery in Resistance is futile: The Web will assimilate your Database[ix].

Flat file Solution

Figure 6: Flat file solution

The file system is not designed to maintain links and hierarchical structure and is therefore not suitable for complex XML documents, though it is good for HTML. Linking is based on location so link management is hard which results in lost integrity. It is hard to query the documents, since the file system stores very little meta-data (file name, extension, creation date, size). Concurrency is low because locking is based on a file so only one user at a time can be editing the document.

Relational Solutions

A typical relational database uses SQL as query algebra. Relational databases can store the data in different manner than a file system:

  • As a BLOB (Binary Large Objects) without distinguishing between different elements.
  • Divide the document into different columns and reassemble the data for retrieval.
  • As a BLOB where elements have been extracted to build an index.

The down side of storing XML documents, as BLOB is that it prevents linking of miscellaneous elements. By dividing the document, it is hard to create a flexible and scalable locking scheme in a relational database. If we extract elements to create an index, it will provide us with the means of querying the data even though we store it as BLOB objects.

Figure 7: Relational Middle-ware

SQL provides very primitive functionality in searching in a text document. SQL operator LIKE can search through sub-strings for a specific text. One solution to this problem would be to create a word-table as an index with all words in the texts and a foreign key to the texts they appear in. Than we can use SQL operator JOIN to merge the table and the text for matches. This solution is not very scalable. With many concurrent users and a big join query, the performance is not acceptable. XML is trying to move all or some of this processing from the server to the client. This solution also needs more storage space, since we need to keep each word at least twice. The management is also a problem since for interactive updates, the word-table has to be updated. It also doesn’t support word distance operators (like Boolean search operator NEAR) unless word-table is enhanced which results in more complex queries.

By creating a middle-ware for text processing we can use three-tier approach. Then the application doesn’t need to know about the text processing. Different implementation schemes can be used with three-tier approach witch adds more flexibility. For example the word-table could be stored in the database or in the text engine. However, it is harder to utilize the database management system since the middle-ware takes care of interpreting the text queries and process them. This solution, as the one before, needs more storage space, since we need to keep each word at least twice.

An example of a template solution that works on the server side is Cold Fusion, which works as a gateway between a database and HTTP server. A document author creates an extended HTML document that contains database access request and formatting rules. Other server side options are for example ASP (Active Server Pages), Java Servlets and CGI. An example of client side solution would be to create a JAVA program communicating through JDBC to the database. However, XML is much easier solution. By converting data from the database to XML we can use common interface to query the data. Other search engines can also index the data in a meaningful manner.

To query the database directly, some databases have defined a database URL (Universal Resource Locators) query. An example is the Oracle web server, which queries the database over the HTTP protocol. JDBC (Java Database Connectivity) is an API to access relational databases but since there is no standard communication protocol between client and a relational database, JDBC’s protocol isn’t supported in all relational databases.

The database vendor could also extend the SQL syntax to add operators to extend the text support and indexing. That means that it is hard to move the application to another relational database. This solution will not be as effective in structure like XML.

Object Oriented Solution

Figure 8: OO database

OODB are designed to store objects in their native forms. They are designed to handle arbitrary, variable-length data types and interrelated data, which describe XML document. They should therefore be able to index the data and access it in a reasonable way. Since there is a lot of concurrent users in electronic publishing with different needs, “on-the-fly” publishing is often required. If XML is stored in its native form publishing should be much faster.

Hybrid Solution

Figure 9: Textual database

We will consider a textual database that stores XML field type. Data stored as XML documents are ready for delivery. Other meta-data like name of the original file can be stored with the object. If physical and logical view is separated, logical view can work as an index, defining fine granularity to the data. Indexing techniques to index text operators like NEAR can be stored in the logical view. By using hybrid databases it is hard to move the application to another kind of database. However, since it is optimized to store XML it should have a better performance.

Virtual database

Figure 10: Virtual database

Discussions about XML and databases can not end without mentioning virtual databases. Virtual Databases (VDB) are very popular research project and have been implemented in several ways (Junglee, Strudel). Virtual database systems consist of two things, mediator/metasearcher and wrappers. Wrapper creates a common query interface to databases. Mediator/metasearcher takes care of select databases to query, translate the query and combine the query result. A lot of work has already been done in defining different query algebra.[x][xi]

This technology has the advantage that XML files can be stored in different format and still use only one interface to query them.

Conclusions

XML will bring structure to the chaos of the Internet. If it will be widely used, it will revolutionize how data is indexed, stored and validated. Users can define their own standards which means they don’t have to wait until standards get approved. That will help users to move over to XML. By using XLL multiple copies of the same document can be managed in a reasonable manner.

Web-sites grow fast so though a flat file structure is good enough, other options should be kept in mind. It is easier to maintain only one database. Therefore, if a company already has a relational database, there are solutions supporting continuous use of a relational database. One of them is to transform the database into XML and use granular updates to update XML documents. If starting from scratch, an object-oriented database would be the most feasible solution. That offers to store XML documents in their native format. Hybrid solutions can be used for performance critical data.

Virtual database can be used to query different data sources. By defining XML documents both outside and inside the database should not create a problem. In the future we could expect to see more intelligent searches on search engines as more users tag their documents in XML.

References

[i] DocBase - A database environment for structured documents Arijit Sengupta, December 1997

[ii] XML: Enabling Next-Generation Web Applications,

[iii] OASIS: The SGML/XML web page, - xml-osd

[iv] XML: Extensible Markup Language, Elliotte Rusty Harold, 1998

[v] The XML companion, Neil Bradley, 1998

[vi] XML, Java, and the future of the Web, Jon Bosak, , March 1997

[vii] SIM versus Relational Technology,

[viii] XML – The Foundation for the Future,

[ix] Various articles from Data Engineering vol. 21 no. 2, , June 1998

[x] Various articles from QL ’98,

[xi] XML-QL: A Query Language for XML,