Storing XML in ORDBMS

Storing XML in ORDBMS

By Amine Kaddara

Supervisor: Dr Hachim Haddouti

Introduction

Object-Relational databases and XML

Object-Relational databases: Definition
Storing XML in ORDBMS: Motivation

Mapping XML to ORDBMS

Introduction

Mapping Schemas
Mapping DTD’s to Object schemas
Mapping Object schemas to Object relational database schema

Mapping Complex Content Models
Mapping Sequences and Choices
Mapping Repeated Children
Mapping Subgroups
Mapping Single-Valued and Multi-Valued Attributes
Mapping ID/IDREF(S) Attributes

Generating Schema
Generating Relational Database Schema from DTDs
Generating DTDs from Database Schema

Related Technologies

Querying XML in ORDBMS
Java DOM
Java Data Objects

Introduction

In this paper, I will discuss a storage system based on Object-Relational DBMS for XML. First, I give an introduction explaining the Object-Relational database technology. Then I will move to explain different motivations behind storing XML in the O-R database management systems. For this purpose, we first analyze the mapping from XML document structures( DTD’s in this case) to Object schema and from the object schema to the Object–Relational database schema. Then, based on the DTD structure , we will understand how the mapping is executed on the different components of the DTD document . The following part will be an introduction of the JDOM(Java DOM) and the JDO(Java Document Objects) API’s.

Object-Relational databases and XML

Object-Relational databases: Definition

Object-relational database management systems combine object and relational technology. These systems puts an object oriented front end on a relational database (RDBMS). Programs that are based on an object oriented programming languages interface to this database as if the data is stored as objects. However, the system will convert these objects into data tables, rows and columns. It will then handle the data in the same way as it handles a relational database.

In the process of retrieving the data, it must be reassembled again from simple data into complex objects. The main benefit to this type of database lies in the fact that the software to convert the object data between a RDBMS format and object database format is provided. Therefore it is not necessary for programmers to write code to convert between the two formats and database access is easy from an object oriented computer language.

b. Storing XML in ORDBMS: Motivation

It is widely accepted that XML will be the standard for documents having structural information on the Web. The number of documents and applications that require manipulation of large set of data is growing. Therefore an efficient managing and storing of these XML documents is required. There exist several types of XML storage systems, but most of them use relational DBMSs. Storing XML documents as database records requires a specification of the mapping from the document structures to database schema. Most commercial DBMSs provide such specification languages, but the languages are proprietary and limited to specifying a mapping to relational databases only. Database vendors today offer hybrid systems that combine their relational DBMS and the Object Relational technology as part of the same product. Another important aspect about ORDBMS is that they allow a more expressive type system which coincides with the purpose of XML (user-defined tags => representation of real-world entities).

Mapping XML to ORDBMS

Introduction

The most important part of storing XML is how to map the XML model to OR database model. The object relational mapping strategy models the data in XML documents rather than the documents themselves. As a consequence this kind of mapping is better suited for data-centric documents. Another important characteristic of this mapping is that it is bidirectional: that is, it can be used to transfer data both from XML documents to the database and from the database to XML documents. As a result we can use canonical mappings where XML query languages can be built over non-XML databases. The canonical mappings will define virtual XML documents that can be queried with something like XQuery. Another important feature of this mapping is that it allows data binding which are the marshalling and the unmarshalling of data between XML documents and objects.

Mapping Schemas

Mapping DTDs to Object Schemas

Some conventions that make the analogy between XML data types and an object programming language data types and impose in the mapping are:

Simple elements and attributes are mapped to scalar data types (single value data types).

 Complex types mapped to classes, with each element type in the content model of the complex type mapped to a property of that class.

References to complex element types are mapped to pointers/references to an object of the class to which the complex element type is mapped. The data type of each property is the data type to which the referenced element type is mapped.

Attributes maps to properties, with the data type of the property determined from the data type of the attribute.

Example:

DTD Classes

======

<!ELEMENT A (B, C)> class A {

<!ELEMENT B (#PCDATA)> String b;

<!ATTLIST A ==> C c;

F CDATA #REQUIRED> String f;

}

<!ELEMENT C (D, E)> class C {

<!ELEMENT D (#PCDATA)> ==> String d;

<!ELEMENT E (#PCDATA)> String e;

}

Simple element types B, D, E, and the attribute F are all mapped to Strings( can be other data types if explicitly changed by the programmer or if we use an XML schema).

Complex element types A and C are mapped to classes A and C.

The content models and attributes of A and C are mapped to properties of classes A and C

The reference to C in the content model of A is mapped to a property with the type pointer/reference to an object of class C because element type C is mapped to class C.

Note: if an element type is referenced in two different content models, each reference must be mapped separately.

Mapping Object Schemas to Relational Database Schemas

The second step of object relational mapping is to map the object schema to the database schema. The mapping involves the following steps:

Mapping classes to tables (known as class tables).

scalar properties are mapped to columns

pointer/reference properties are mapped to primary key/foreign key relationships

If the relationship between the parent and child elements is one-to-one, the primary key can be in either table

If the relationship is one-to-many, the primary key must be on the "one" side of the relationship, regardless of whether this is the parent or child

A primary key column can be created as part of the mapping

If a primary key column is created as part of the mapping, its value must be generated by the database

Example:

Classes Tables

======

class A { Table A:

String b; Column b

C c; ==> Column c_fk

String f; Column f

}

class C { Table C:

String d; ==> Column d

String e; Column e

} Column c_pk

The tables are joined by a primary key (C.c_pk) and a foreign key (A.c_fk).

Note: Names can be changed during the mapping. For example, the DTD, object schema, and relational schema can all use different names. For example, the DTD uses different names than the class:

DTD: <! ELEMENT Part (Number, Price)> =>class name: class PartClass

Class name: class PartClass => Table name: Table PRT

Also, the objects involved in the mapping are conceptual. That is, there is no need to instantiate them when transferring data between an XML document and a relational database.

Mapping Complex Content Models

Mapping Sequences and choices

Each element type referenced in a sequence is mapped to a property, which is then mapped either to a column or to a primary key, foreign key relationship. Each element type referenced in a choice is also mapped to a property

then either to a column or a primary key, foreign key relationship. The only difference from the way sequences are mapped is that the properties and columns can be null.

Example:

class A { Table A (

String b; // Nullable Column b // Nullable

C c; // Nullable }Column c_fk // Nullable

ii. Mapping Repeated Children

Repeated children are mapped to multi-valued properties and then either to multiple columns in a table or to a separate table, known as a property table.

If a content model contains repeated references to an element type, the references are mapped to a single property, which is an array of known size.

Then it can be mapped either to multiple columns in a table or to a property table.

Children that are optional in their parent are mapped to nullable properties, then to nullable columns.

Example:

DTD Classes Tables

======

<!ELEMENT E (K, K, K)> class E { Table A <!ELEMENT K (#PCDATA)> ==> String[] k; ==> Column k1 Column k2

} Column k3

<!ELEMENT A (B+, C*)> class A { Table A

<!ELEMENT B (#PCDATA)> ==> String[] b; ==> Column a_pk

<!ELEMENT C (#PCDATA)> String c //nullable; Column c //nullable

}

Table B

Column a_fk

Column b

iii. Mapping Subgroups:

References in subgroups are mapped to properties of the parent class, then to columns in the class table.

Example:

<!ELEMENT A (B, (C | D))> class A { Table A

<!ELEMENT B (#PCDATA)> ==> String b; // Not nullable column b // Not nullable

<!ELEMENT C (#PCDATA)> String c; // Nullable column c // Nullable

<!ELEMENT D (E, F)> D d; // Nullable column d_fk // Nullable

}

iv. Mapping Single-Valued and Multi-Valued Attributes

Single-valued attributes (CDATA, ID, IDREF, NMTOKEN, ENTITY, NOTATION, and enumerated) map to single-valued properties and then to columns.

Multi-valued attributes map to properties multi-valued (and then to property tables).

The order in which attributes occur is not significant, but the order in which values occur in multi-valued attributes is considered significant

Example:

DTD Classes Tables

======

<!ELEMENT A (B, C)> class A { Table A

<!ATTLIST A String b; Column B

D CDATA #REQUIRED> ==> String c; ==> Column C

<!ELEMENT B (#PCDATA)> String d; Column D

<!ELEMENT C (#PCDATA)> }

and:

DTD Classes Tables

======

<!ELEMENT A (B, C)> class A { Table A

<!ATTLIST A String b; Column a_pk

D IDREFS #IMPLIED> ==> String c; ==> Column b <!ELEMENT B (#PCDATA)> String[] d; Column c

<!ELEMENT C (#PCDATA)> }

Table D

Column a_fk

Column d

v. Mapping ID/IDREF(S) Attributes

ID/IDREF(S) attributes map to primary key, foreign key relationships

IDs need to be unique inside a given XML document. Thus, if the data from more than one document is stored in the same table, there is no guarantee that the IDs will be unique. The solution is to change the ID by prefixing it or by mapping the attributes to two columns, one of which contains a value that is unique to each document and the other of which contains the ID

Generating Schema

Generating Relational Database Schema from DTDs

Relational schemas are generated by reading through the DTD and processing each element type:

Complex element types generate class tables with primary key columns.

Simple element types are ignored except when processing content models.

To process a content model:

Single references to simple element types generate columns; if the reference is optional (? operator), the column is nullable.

Repeated references to simple element types generate property tables with foreign keys.

References to complex element types generate foreign keys in remote class tables.

PCDATA in mixed content generates a property table with a foreign key.

Optionally generate order columns for all referenced element types and PCDATA.

To process attributes:

Single-valued attributes generate columns; if the attribute is optional, the column is nullable.

Multi-valued attributes generate property tables with foreign keys.

If an attribute has a default, it is used as the column default.

Example:

DTD Tables

======

<!ELEMENT Order (OrderNum, Date, CustNum, Item*)> ==> Table Order

<!ELEMENT OrderNum (#PCDATA)> Column OrderPK <!ELEMENT Date (#PCDATA)> Column OrderNum

<!ELEMENT CustNum (#PCDATA)> Column Date

Column CustNum

<!ELEMENT Item (ItemNum, Quantity, Part)> ==> Table Item

<!ELEMENT ItemNum (#PCDATA)> Column ItemPK <!ELEMENT Quantity (#PCDATA)> Column ItemNum

Column Quantity

Column OrderFK

<!ELEMENT Part (PartNum, Price)> ==> Table Part

<!ELEMENT PartNum (#PCDATA)> Column PartPK <!ELEMENT Price (#PCDATA)> Column PartNum

Column Price

Column PartFK

In the first step, we generate tables for complex element types and primary keys for these tables.

In the second step, we generate columns for references to simple element types:

In the final step, we generate foreign keys for references to complex element types.

ii. Generating DTDs from Database Schema

DTDs are generated by starting from a single "root" table or set of root tables and processing each:

Each root table generates an element type with element content in the form of a single sequence.

Each data (non-key) column in the table generates an element type with PCDATA-only content and a reference in the sequence; nullable columns generate optional references.

Primary and foreign keys are generated following these steps:

The remote table is processed in the same manner as a root table.

A reference to the element type for the remote table is added to the sequence.

If the key is the primary key, the reference is optional and repeated (*). This is because there is no guarantee that a row will exist in the foreign table, nor that the row will exist.

If the key is the primary key, PCDATA-only element types are optionally generated for each column in the key. If these are generated, references to these element types are added to the sequence. This is useful only if primary keys contain data.

If the key is a foreign key and is nullable, the reference is optional (?).

Example:

Tables DTD

======

Table Orders <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)>

Column OrderNum <!ELEMENT OrderNum (#PCDATA)>

Column Date <!ELEMENT Date (#PCDATA)>

Column CustNum <!ELEMENT CustNum (#PCDATA)>

Table Items <!ELEMENT Items (ItemNum, Quantity, Parts)>

Column OrderNum

Column ItemNum <!ELEMENT ItemNum (#PCDATA)>

Column Quantity <!ELEMENT Quantity (#PCDATA)>

Column PartNum

Table Parts ==> <!ELEMENT Parts(PartNum, Price)>

Column PartNum <!ELEMENT PartNum (#PCDATA)>

Column Price <!ELEMENT Price (#PCDATA)>

First step, we generate an element type for the root table (Orders).

Next, we generate PCDATA-only elements for the data columns (Date and CustNum) and add references to these elements to the content model of the Orders element.

Then we generate a PCDATA-only element for the primary key (OrderNum) and add a reference to it to the content model.

And then add an element for the table (Items) to which the primary key is exported, as well as a reference to it in the content model. We process the data and primary key columns in the remote (Items) table in the same way

Finally, we process the foreign key table (Parts).

Related Technologies

Querying XML in ORDBMS

XQuery can be used as a query language for object relational databases. If an object-relational mapping is used, hierarchies of tables are treated as a single document and joins are specified in the mapping (=> no need to explicitly specify the joins inside the query). With XPath, an object-relational mapping must be used to query data over more than one table . This is because XPath does not support joins across documents.

Java DOM

JDOM is an open source, tree-based(DOM), pure Java API for parsing, creating, manipulating, and serializing XML documents. It represents an XML document as a tree composed of elements, attributes, comments, processing instructions, text nodes, CDATA sections,etc…It is written in and for Java and It consistently uses the Java coding conventions and the class library and it implemets the cloenable and serializable interfaces.

A JDOM tree is fully read-write. All parts of the tree can be moved, deleted, and added to, subject to the usual restrictions of XML. Unlike DOM, there are no annoying read-only sections of the tree that one can’t change.

Java Data Objects

Sun's Java Data Objects (JDO) standard allows the persistence of Java objects into databases. It supports transactions and multiple users and it differs from JDBC in that you don't have to think about SQL or database models. It differs from serialization as it allows multiple users and transactions. It allows Java developers to use their object model as a data model. There is no need to spend time going between the "data" side and the "object" side.

References:

Storing and Querying XML Data in Object-Relational DBMSs byKanda Runapongsa and

Jignesh M. Patel.

XML Content Management based on Object-Relational Database Technology by B. Surjanto, N. Ritter, H. Loeser.

A New Inlining Algorithm for Mapping XML DTDs to Relational Schemas by Shiyong Lu, Yezhou Sun, Mustafa Atay, Farshad Fotouhi