Storage of Native XML

Carroll, Gibbons

February 19, 2006

"XML is the de-facto standard for exchanging data between different systems, platforms, applications, and organizations" [3]. Because of its file structure, XML provides an elegant solution for transmitting different types of data that can be changed dynamically, with both data and data-types being changed or added on the fly. Due in part to this structuring, it is easily possible to have entities whose descriptive data points change within the file, e.g. singularities vs. pluralities.

As such, many database vendors have introduced DBMS systems that are dedicated to the storage and retrieval of XML in its native format. “In a native XML database approach, XML is not fragmented but rather is stored as a whole in the native XML database. This means that documents are stored, indexed, and retrieved in their original format, with all their content, tags, attributes, entity references, and ordering preserved” [1]. Large DB vendors, such as Oracle and IBM have been pressured to add similar features to their DBMS systems in order to cater to the desires of some customers. Although some may argue that these add-on features do not perform nearly as well as their dedicated counterparts, we maintain that the functionality provided by Oracle and IBM is sufficient given the limited usefulness of native XML storage.

As is evident in [5], some within the industry would advocate the use of XML databases to house the information for your entire SOA (Service-Oriented Architecture) system. They contend that since the information is being transmitted via XML, that storage in the same medium is advantageous. Doing so will enable changes to be made to the message without impacting the backend datastores. It will also reduce the amount of coding because with a traditional relational database “… you will have to write code that knows, for every message type, how to take the incoming message, shred it, and populate the tables” [5]. Yet another argument for the use of an XML database, is the ability to transform the data, via XSLT or XQuery, to a multitude of different formats. This allows for multiple user interfaces to access the same XML document, but display the content in difference ways without developing application logic for each format [5].

We contend that native XML is just a new twist on the old hierarchical databases, in that it suffers from the same problems that relational databases were created to resolve. Namely, the inability to support many-to-many relationships, resulting in duplicate data being stored multiple times in the database, and the lack of referential integrity constraints. To illustrate this point, picture a bank account that is represented in XML format. There will be several nodes within the document, one of which represents the account itself, containing items such as account number, type, etc. Another node will represent the customer and include name, address, phone, etc. Now, if the same customer opens another account, that information will be duplicated in another document, increasing the amount of hardware resources needed to store the information, as well as increasing the likelihood of customer information stored in account “A” becoming out of sync with information from account “B”. Also, most XML databases do not enforce referential integrity, and therefore the responsibility is passed onto the application [1].

In addition, legacy applications have difficulties accessing and modifying data stored in this format, both because of the limitations of legacy programming languages and the lack of XML skills by analysts that maintain the applications. Resolving this issue will either require additional training for legacy programmers, or storage of the data in two formats, as proposed in [5], with a complex replication process to keep both copies in sync with each other.

Another problem that arises from native XML database is the issue of parsing the data. This can increase the time of execution for queries many times over. Evidence from [2] also shows that the parsing efforts require a very large amount of CPU utilization. Their experience “…working with companies which have introduced or are prototyping XML database applications, shows that XML parsing recurs as a major bottleneck and is often the single biggest performance concern seriously threatening the overall success of the project”. By placing the data into a relational format, these parsing issues may be contained. Although one might argue that the parsing cost is merely passed from the DBMS to the application, we believe that the relational representation of the data is usable to a larger audience and thus the parsing effort is actually reduced across the enterprise.

Native XML is, however, useful in a couple of limited cases. One such example is when data is used solely by a single application, presumably by a small sector of the company, with little duplication of data from document to document. The data manipulated by this application would not be considered enterprise, and therefore, the expense required to build a robust solution for a limited number of users would not be justified.

Another good use of XML is for storage of "work in progress" data. For example, in the bank account situation mentioned above, the application process may be stretched over a period of days. Continuously parsing the XML into relational table and then recreating it for use would not be efficient given the limited life of the data itself. If the application that is managing this data is capable of easily handled XML documents, it is permissible to maintain it in this format until it is ready to be released to the enterprise. At that point, the data needs to be broken into relational tables in order to be consumed by all applications.

As a data exchange standard, XML falls into to the same requirements as the previous standard, EDI. Therefore, there might be a need to reproduce the message. By keeping data in XML format, all of the data is there exactly as it was first implemented, whereas by translating the XML document into a table, some of the data is lost, even if it is only the structure of the data. Not that the same information can not be gathered through queries either way, but the essence of the data changes, which can be important in legal cases where exactness is required[3]. That being said, we feel that if the message must be produced in its original format, there is nothing to prevent data in XML format from being kept in a character field, on disk or even tape backup. We do not believe that native storage is a requirement in this case. One exception, however, would be if there was a need to search the content of the XML documents. In this case, storage as native XML would be necessary, in order to use XML querying languages, such as XQuery.

Additionally , some types of data, such as that from life sciences, also prefer a tree based format that XML provides rather than the table of DB2[3]. Since these are special cases, special arrangements can be made. This does not however mean that data not inherently following a tree structure has any advantage in using such a structure.

Although there are some advantages to keeping data stored in a native XML format, we believe that the difficulties inherent with hierarchical structures, the integration with legacy systems, and theexcess use of CPU far outweigh the benefits in most situations. Therefore, continued use of DBMS’s where the native XML storage has been integrated with the existing relational capabilities is recommended over DBMS’s that solely store data as XML, as these DBMS’s can not meet the needs of the entire enterprise.

We do, however, caution against the mixing of relational and XML storage within the same application, especially if there is a need to query from both. Although, DB2 does allow for the mixing of SQL and XQuery [4], thisintroduces additional complexities and the need to be fluent in both languages in order to produce optimal queries.
References

[1] Shalaka Natu, John Mendonca, “Digital Asset Management Using A Native XML Database Implementation”, CITC4 ’03, October 16-18, 2003, Lafayette, Indiana, USA.

[2] Matthias Nicola, Jasmi John, “XML Parsing: A Threat to Database Performance”, CIKM ’03, November 3-8, 2003, New Orleans, Louisiana, USA.

[3] Mattias Nicola, Bert van der Linden, “Native XML Support in DB2 Universal Database”, Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005.

[4] “Native XML data store overview”,

[5] “Use XML databases to empower Java Web services”,