DTD Schema: A Simple but Powerful

XML Schema Language

Mengchi Liu

Email:

School of Computer Science

CarletonUniversity

Ottawa, Ontario, CanadaK1S 5B6

Abstract. Purpose — Description of a novel XML schema languagecalled DTD Schema that solves major limitations of DTD and supports

features that XML Schema supports in a simple and concise way.

Design/methodology/approach — DTD Schema is designed based on DTDand data definition language of object-oriented and object-relationaldatabases. It extends DTD with namespaces, richer built-in types anduser-defined subtypes, local elements and attributes, complex types withnonmonotonic multiple element and attribute inheritance with overriding,blocking, conflict handling, and polymorphism.

Findings — XML Schema is recommended by W3C as the schema languagefor XML. It uses a set of predefined XML tags to define the schema,which is often a long, intricate specification, full of details and conceptsand its verbose syntax often doubles or triples the document length. Itis so complicated that even XML experts do not find it human-readable,mostly due to the XML-based syntax.

Research limitations/implications — The only limitation is that DTDSchema is not in XML.But for the same reason, it is simple and concise.

Practical implications—DTD schema is halfway between DTD and XML Schemaand thus it is less complex and much easier for human to use than XMLSchema.

Originality/Value—DTD Schema supports all functionalities of XMLSchema and also the best of object-oriented features including multipleinheritance, overriding, blocking, conflict handling and polymorphism.Therefore, it is much more expressive than XML Schema.

Keywords: XML, XML schema languages, type hierarchy, non-monotonicinheritance.

Article Type: Research paper

Introduction

XML (Bray et al.,2006), Extensible Markup Language, is a simplified subset of the StandardGeneralized Markup Language (SGML), which was created in the mid-90s forspecifying documents that can be exchanged and automatically processed bymachines. One of the advantage of XML as a data exchange format is the standardizationof validation technology, which is good for both parties who exchangedata using XML on the Web. As a result, XML is fast emerging as the dominantstandard for data representation and exchange over the Internet.

DTD, Document Type Definition, is the first and widely used schema languagebuilt into XML(Bray et al.,2006). Its technical underpinnings come from the theory offormal languages, and general-purpose parsers that can validate anydocumentagainst any DTD are well known(Bernstein et al.,2005). It uses a set of rules to define a schema,which is very concise and easy to read. However, its support for schema structureis minimal and has a rather restricted expressive power(Lee and Chu, 2002).

With the fastdevelopment, XML has introduced the possibility of treatingWeb documents as data sources that can be queried (as with database relations)and that can be related to each other through semantically meaningful links (aswith foreign-key constraints). It was at this point that XML began to outgrowits SGML heritage. One of the first enhancements, is XML namespaces(Brayet al.,2006). Anotherimportant enhancement is the development of the XML schema specification (Brown et al., 2001),which is designed to rectify many of limitation of DTD as a data definitionlanguage (Bernstein et al.,2005). These limitations include the following:

(1)DTD does not support namespaces.

(2)The syntax of DTD is quite different from that of XML documents so wecannot use standard XML tools to manipulate DTD schemas.

(3)DTD only supports a few built-in types such as CDATA, PCDATA and doesnot support user-defined types and cannot constrain character data.

(4)DTD provides only limited means for expressing data-consistency constraints.It does not have keys (except for the limited ID type), and the mechanismfor specifying referential integrity is very weak.

(5)Element definitions are global to the entire document.

(6)Lack of the inheritance mechanism. The basic modeling primitives in DTDare attribute, element and entity. DTD can be easily stated the has-a relationshipby the hierarchy among elements. But the derivation relationshipamong types is not easily expressed naturally in DTD.

On the other hand, XML Schema(Biron et al.,2004; Thompsonet al.,2004)is the most powerful schema language forXML recommended by W3C, developed in response to solve the limitations ofthe DTD with the following main features:

(1)It uses the same syntax as that used for ordinary XML documents.

(2)It is integrated with the namespace mechanism.

(3)It provides full datatypes support.

(4)It can derive complex types from base types.

(5)It allows element name to be local so that the different elements can havethe same name with distinct types.

(6)It supports key and referential integrity constraints.

(7)It provides a mechanism for specifying documents where the order of element types does not matter.

However, XML Schema is much more complicated than DTD and even XMLexperts do not find it human-readable(Møller and Schwartzbach, 2005), mostly due to the XML-based syntax.It uses a set of predefined XML tags to define the schema, which is often along, intricate specification and its verbose syntax often doubles or triples thedocument length compared to DTD(Amorosi et al.,2003).

Actually DTD is still widely used where the advanced features describedabove are not required. Although XML Schema is getting popular now, it can notreplace DTD completely. Many users of markup languages have invested heavilyin the development of DTD to help manage their business and industries. Manyindustries have standardized DTD that manage XML data between applications.If these work well, there is no reason to convert to XML schema(Campbell et al.,2003).

Recently, some researchers did a survey on how XML Schema is actually usedin practice. In (Bex et al.,2005), researchers harvested a large corpus of XML schemas fromthe Web. They pointed out that the vast majority of XML schema in practiceare structurally equivalent to a DTD; moreover, they found that there were only15% of the syntactically correct XSD used typing in a way that went beyondthe power of DTD. This result revealed that the intricate specification of XMLSchema prohibited its actual modelling power from users. Bex et al. provides astudy on the use of XML Schema and DTDs culmulating in the observation thatthe major addition of XML Schema(missed by practitioners) are types, whichcan be easily supported in extended DTDs(Bex et al.,2004).

Besides DTD and XML Schema, there are many other schema languagesproposed, including SOX(Davidson et al.,1999), Schematron(Jelliffe,2000), DSD (Klarlund et al.,2002), Schemapath(Coenet al.,2004), and RELAX NG(Clark,2002).Some of them extend DTD such as SOX and RELAX NG, and some use XMLsyntax such as DSD and RELAX NG. They are smaller and easier to read thanXML Schema. However, they are not as expressive as XML Schema.

Because XML Schema provides the strongest modeling ability in terms ofinheritance amongst XML schema languages, we

briefly discuss its inheritancemechanisms in the following. In XML Schema, a schema document may containtype definitions, element and attribute declarations. A new type can be derivedby extending or restricting the base type which may be either complex or simple.A new simple type can be derived using the restrictionmechanism and the set ofvalues represented by the new simple type is a subset of values of the base simpletype. A new complex type can be derived with the extensionmechanism byinheriting a complex base type and appending some additional specific elementand attribute declarations. Like the simple type restriction, a new complex typecan also be derived using the restrictionmechanism. Restriction of complextypes is conceptually the same as restriction of simple types, and a complex typederived by restriction is very similar to a base type, except that its declarationsare more limited than the corresponding declarations in the base type. The valuesrepresented by the new type are a subset of the values represented by the basetype.

In XML Schema, there is a substitutiongroup, which allows elements to besubstitutable for other elements and can be used to simulate the polymorphicfeature. Figure 1 declares two new elements chineseCommentand englishCommentand makes them substitutable for the comment element in the instancedocument. Although the substitution mechanism can be used to simulate thepolymorphic feature, it has two shortcomings:

(1)for an element hierarchy, theuser has to declare an substitution group for each super-element;

(2)if a newsub-element is added into the element hierarchy, then the declarations of substitutiongroups of its super-elements have to be modified.

XML Schema provides the redefine mechanism that can be used to supportevolution and versioning of schemas. Unlike the include mechanism which enablesusers to use external schema components without any modification, theredefine mechanism allows users to incorporate external schema componentswith modifications. Because attribute group definitions and model group definitionsmay be supersets or subsets of their original definitions, the redefinemechanism can be used to simulate overriding and blocking of element inheritancein an element hierarchy, in a two-steps way. For example, for the element hierarchy with person and student, element address is overridden with a simpletype in sub-element student. With XML Schema, this can be simulated in twosteps:

(1)A temporary type definition studentis derived from the base type personwith the extension mechanism, and has the same element definitions. Thederived definition is stored as a temporary schema document student_tmp.xsd.

(2)The external schema document student_tmp.xsdis redefined with necessarymodifications, and then the redefined schema is stored as student.xsd.

The main shortcoming of the two-steps way is that a temporary externalschema document must be generated, because type definitions must use themselvesas their base type definition in the redefine mechanism.

So far, we have introduced almost all the inheritance facilities in XML Schema.We can get the following conclusions:

(1)XML Schema does not support theinheritance of attribute.

(2)XML Schema only supports single inheritance, becauseonly one base type is allowed to be in the extension construct. Since themultiple inheritance cannot be supported in XML Schema, some concept-levelsemantics cannot be directly represented in XML Schema.

(3)Polymorphism isnot directly supported in XML Schema, and can only be indirectly supported byusing the substitution mechanism.

(4)XML Schema does not support overridingand blocking directly. But they can be simulated via the redefine mechanism anda superfluous external schema document has to be generated.

Nonmonotonic inheritance is a fundamental feature of object-oriented datamodels (Liu et al.,2002). In object-oriented languages with multiple inheritance, a class mayinherit attributes and methods from more than one superclass. One of the problems withmultiple inheritance is that an ambiguity may arises when an attributeor method is defined in more than one superclass. Therefore, conflict resolutionis important in object-oriented database systems with multiple inheritance andmost systems use the superclass ordering to solve the conflicts(Liu et al.,2002).

In this paper, we present a novel XML schema language called DTD schemathat is based on DTD and the data definition language of object-oriented andobject-relational databases. It extends DTD with the following features:

(1)namespaces;

(2)richer built-in types and user-defined subtypes, which are declared using theISA mechanism in which an existing type is used as the base type and the setof values represented by the derived type is the subset of values representedby the base type;

(3)not only global elements but also local elements in complex types with nestedelements;

(4)complex types and type hierarchy with nonmonotonic inheritance of elementand attribute definitions, overriding of elements or attributes inheritedfrom supertypes, blocking of the inheritance of elements or attributes fromsupertypes, and conflict handling;

(5)constraints on domain values, key, uniqueness, not null, references;

(6)polymorphism to support polymorphic elements, typing of references andpolymorphic references;

(7)sequence, choice, all.

DTD Schema supports all functionalities of XML Schema and also the bestof object-oriented features including multiple inheritance, overriding, blocking,conflict handling and polymorphism. Therefore, it is much more expressive thanXML schema. On the other hand, it is halfway between DTD and XML Schemaand thus it is less complicated and much easier for human to use than XMLSchema. With simpler form and more expressive power, DTD Schema is a real feasible schema language for developing XML applications.

There are some other attempts to extend DTD with strong expressive power. There are two versions of DTD++(1.0(Amorosi et al.,2003) and 2.0(Fiorello et al.,2004)). DTD++ provided a solutionfor specifying and verifying co-constraints on XML documents. In (Bex et al.,2005), itstudied the actual expressive power of XML schema definition and proposedan equivalent formalism based on contextual patterns rather than on recursivetypes, which might serve as a light-weight front end for XML schema. But ourDTD Schema language is the first to extend DTD with the most features offeredby XML schema and adds new features like multiple type inheritance, polymorphictypes and type definition overriding in a uniform framework. We alsoprovide concrete examples to illustrate the expressiveness of DTD Schema. Theway we extend DTD with strong expressive power of XML Schema in this papercan also be applied to other schema languages.

Figure 2. A Sample Schema in DTD Schema

The remainder of this paper is organized as follows. We first introducenamespace declarations in DTD Schema, then present basic types of DTDSchema and user-defined simple types, describe global andlocal elements, discuss complex type definitions and non-monotonicmultiple inheritance with overriding, blocking, and conflict handling, introduceintegrity constraints, illustrate polymorphism including polymorphic element and polymorphic typingof references, and polymorphic references, demonstrate integrityconstraints, and finally conclude the paper.

For the convenience of discussions, a DTD Schema sample and a valid XMLinstance document are given first in Figures 2 and 3, respectively. The discussionsin the throughout paper are based on them. This example shows a universitysample, in

Figure 3. An XML instance Document

which there are a root element university and 7 child elements: course, underCourse, gradCourse, person, student, teacher, and TA, 9 user defined simple types: postcode, sno, tno, cno, semCode, salary, phone, CAcode, and UScode, 6 complex types: addressType, courseType, underCourseType, gradCourseType, personNameType, studentType, teacherType, TAType. Types underCourseType and gradCourseType are subtypes of courseType, studentTypeand teacherTypeare subtypes of personType; TATypeis the sharingsubtype of studenTypetand teacherType.

Namespace Declarations in DTD Schema

An XML namespace is a collection of names, used for element types or attributenames. Using namespaces, the nameambiguity can be eliminated. DTD doesn’tsupport namespace: a DTD can define any number of elements and attributes,butthere is no way to associate them with a namespace. Missing namespacesupport makes DTD incapable of supporting code reuse by enclosing constructsfrom external schema documents and unsuitable for flexible and modular designof complex XML applications.

In DTD Schema, we introduce namespaces as in XML Schema so that samename can be used in different context with different meanings without causingany conflict. This will solve the possible name conflict when we enclose an external construct which has the same element or attribute name as in the enclosing document by using different namespaces. Since DTDSchema is not writtenin XML, the namespace declaration is designed in DTD-like syntax.

Example 1. The following are three namespace declarations.

<!NSPACE DEFAULT

<!NSPACE TARGET

<!NSPACE adm

The first namespace declared in this example specified the namespace as default.The second namespace declared associate elements and attributes with the admnamespace. The last declaration says that the new elements and attributes definedin DTD are considered to be part of the namespace.

In XML Schema, different schemas can be imported from different namespacesand integrated into one schema. Here, weextend DTD to support theassembling of DTD from multiple DTDs by including and importing mechanisms.Include is used toassemble a DTD from multiple DTDs with the samenamespace. Import is used to assemble a DTD from multiple DTDs with differentnamespaces.

Example 2.The following are two examples of import and include another namespaces.

<!INCLUDE

<!IMPORT crs

There are two declarations ofincludeand import. The effect of the first declarationis to include the DTD at inthe given document. Note that the included DTD should have the same targetnamespace.The second import statement imports a DTD with targetnamespace This namespace is assigned the prefix crsso that any part of the imported DTD could be referred with this prefix. Here weassumes that the imported and importing DTD have different targetnamespace.

Basic Types in DTD schema

One of the most evident shortcoming of DTD is that DTD has a very limitedrepertoire of basic types, essentially just glorified strings. DTD only supportsabout 10 kinds of XML-related primitive type such as #PCDATA and #CDATA.Thus, there is no way to specify the content of an element to be a decimalnumber or a string with specific pattern. To solve this problem, XML Schemaprovides an extensive richer set of basic types, covering most types used ingeneral programming language.

In DTD Schema, we provide the same basic types for attributes and elements,including STRING,INTEGER, FLOAT, DATE, TIME, BOOLEAN, andID, etc.Values of these types are defined in a usual way.

For a schema language, the legal values for the type are as important asthe type. Constraints on possible value for a simple data type could contain arecalled facets. In XML Schema, new simple types can be created by deriving frombasic or derived types via the extensionor restrictionmechanism, and differentfacet types are supported including length, pattern, precision and range.