T6L3

XML

Introduction

This lesson is designed for you gain greater understanding of the eXtensible Markup Language, XML, and its implication as an emerging technology for use in a variety of Web-based situations. When you finish this lesson you will be able to:

•Define XML.

•Define the origins of XML.

•List how to use XML.

The use of XML is tied to Cascading Style Sheets (CSS). You should read about CSS prior to or in conjunction with this lesson.

This lesson will cover:

  • What is XML?
  • XML Concepts
  • XML Basics

Additional Resources

Webmonkey - XML

<devhead> XML

webreview.com - XML

What is XML?

If you are a Web weenie, you probably have heard about XML. The hype surrounding it has been extensive and with the release of Internet Explorer 5.0, XML was supported in the first commercial browser. So you are probably wondering what XML is and what all this fanfare is about.

XML stands for eXtensible Markup Language. As the name XML implies, it was originally designed to extend the capabilities of HTML. It is derived from SGML (Standard Generalized Markup Language). It is really a subset of SGML designed specifically for the Web. Much like SGML, you use XML to define different markup languages for specific uses. But because XML is much less complex and easier to understand, it reduces the barriers for use that are part of SGML.

Why XML?

HTML certainly brought about a revolution by enabling the display of various kinds of information using the World Wide Web. But this technology is starting to show its age, as many advanced Web applications are limited by the characteristics of HTML. XML is the building block of the next generation of Web applications. XML also removes two current constraints that have limited Web development. First, the dependence on HTML as the only document type, and second, it removes the complexity that has been associated with SGML development by providing an easy to use and understandable development environment.

First lets look at why HTML is showing its age. Some of the limitations of HTML are as follows:

• HTML is primarily a presentation technology.

• HTML cannot be extended because it has a fixed set of tags.

• HTML is flat representation, thus, you cannot represent information that needs a hierarchy for its representation.

• HTML is server bound, it doesn’t allow the client to easily process data. The client is merely a display device.

• HTML can only display one view of the data. Additional views must resort to reconnection to the server and send new views of the information.

• Because the design of tags is not consistent - some need both beginning and ending tags (like <name> and /<name>) and others (<p> or <b>) needing only beginning tags - it makes reading the data difficult.

Standards Efforts

Driving the acceptance of XML are standards developed by the World Wide Web Consortium, known as the W3C. XML evolved because the W3C realized for the Web to continue to develop it needed to separate display information from data. XML is streamlined for the purpose of data transfer through the Web and it is much simpler to use than its predecessor SGML. You can find more information on the XML and the standard efforts at: [[link to this]].

Why Use XML Instead of HTML?

There are a number of reasons why you should use XML instead of HTML. Here are a few:

• XML is not restricted to a limited set of tags.

• With XML you can create your own markups by defining your own tags to match your problem.

• Tags can be designed to represent your exact data by defining your own tags and rules.

• The graphical look-and-feel can be separated from data by using style sheets and data can be imported into these forms.

• Tags that reflect the nature of your data allow for faster processing when searching or sorting.

• Complex information can be communicated.

• Code is much easier understood.

• Information is much more accessible and reusable.

• Valid XML files are also valid in SGML.

Benefits of using XML

•XML allows the exchange of data between partners.

•With XML you can load and manipulate data on server or client.

•In the browser you can eliminate trips to the server for data.

•XML can be used with other compatible technologies like XSL ands CSS to format data and provide multiple views of the data.

The Future of XML

XML is a powerful new technology that is positioning itself to become the building block for the next generation of Web applications. Because it is a standard and there are specifications for its use, it will increasingly become important for building smarter applications for the Web. Microsoft added support for XML in its 5.0 version of Internet Explorer.

But to gain greater acceptance, XML needs to be integrated into the other commercial browsers, such as Netscape Navigator. But don’t despair, Netscape is currently working on its next generation browser called Gecko with the open source group Mozilla.org.

For more information on this effort see:

[[link to this]]

With Gecko, Netscape will begin to support the XML standard. This will be a major step forward for the development of the next generation of application tools for the Web.

Because you are able to separate display information from data when you use XML, this opens the doors to a variety of new types of applications that can exploit these added capabilities. It opens the way to database publishing directly to the Web, the wider use of metadata to describe information, new e-commerce applications, and the greater use of science applications on the Web. Examples from the following list will illustrate how XML provides these added capabilities:

Database Publishing

Database publishing has been problematic for adoption on the Web. Databases have typically exchanged information using simple file formats like one-record per line with a semicolon between the fields. This is not sufficient in today’s object-orient database models. Objects must have internal structure with links between them. XML can represent these structures using elements and attributes. By separating the data from the display information, very powerful applications can be developed that just draw data from the database directly into the application. Using XSL (eXtensible Style Language) and CSS (Cascading Style Sheet), predefined formats can be loaded once and data streamed into the formats.

Electronic Commerce

By using a pre-agreed upon XML document type, business partners could automatically send information using XML between systems. While this capability has been possible for years, XML allows it to be easily standardized, extensible and built into the infrastructure of Internet.

Metadata Applications

Another area where XML is poised top play an important role is in the exchange of metadata. Metadata is simply information about information. A good example is the card catalogue in the library. The individual card describes a resource held in the library. It is not the book or periodical itself but the description about the item. Thus, the card in the card catalogue is the metadata about the resource. In the same way, there is a wide variety of metadata that could be exchanged and many applications that will develop as the capability to exchange this information becomes commonplace.

XML Concepts

This section will provide some basis concepts and terminology related to XML. It will not drill down into the syntax of the language. Let’s start with some terminology.

What is XML Used For?

XML is used to provide a digital representation of a document. This document might be books, article, memo, email message, or an on-line course. When we talk about digitally representing a document, we are referring to how we put this document in a computer-readable form so that we might store, process, search, transmit, display or print that document. For this to happen we need to tell the computer about the structure of this document. Thus, our goal with XML is to put the document in a code that the computer can understand.

XML documents can include pictures, movies and other multimedia. We do not represent this information as XML, we leave it in the native formats because there is not a simple way to translate these formats into XML. Instead, we just include them in their native formats and refer to them in our XML documents.

Elements

In defining a digital document, there must be a logical structure for the document. For example, if you use a book as your example, it has a logical structure. Most books have chapters and each chapter will contain titles, sub-titles, paragraphs, figures and pictures.

In XML, these components are called elements. Elements can contain other elements and each of the elements describes a logical part of our digital document. The words and sentences of the document are called character data. Thus, our book described as a digital document, can be represented a hierarchy or tree structure. The name of the specific book that we would be describing is the root element, and the chapters would be the branches or sub-elements, and titles, paragraphs, and figures would be the leaves. In XML, elements can have extra information attached to them called attributes. Attributes are used to describe the properties of the element. Think of attributes as footnotes that gives greater detail or shows the origin of an element.

Entities

Aside from the logical structure of an XML document, it must also have a physical structure. This physical structure is defined by a series of characters. This physical structure allows an XML processor to start at the beginning of the document and read until the end. Thus, XML provides a way for pieces of text to be organized in a non-linear manner and segmented into pieces. A parser can then be used to reorganize the pieces into a linear structure.

The pieces of text are known as entities. These entities have names, which are inserted into your document by using an entity reference. When the processor reads down through your XML document, it sees the entity reference and replaces the reference with the entity itself. Thus, an XML document can be broken up into many files, or entities ,and stored in a database, on a hard disk or generated on the fly by a database. These entities can be located anywhere on the Internet. We can see our digital document has XML elements provide the logical structure for our document and entities keeping track of the location of chunks of text that make up the physical structure of the document.

Entities are used to break up the large files into manageable pieces that can be edited searched, downloaded and generally made useful by the computer system. Entities allow an author to break up large document into pieces and make parts of it small enough to access easily over the Internet. Even though these pieces might be located in a variety of locations, the document still has a logical structure that can be accessed for searching, editing or downloading when necessary.

Document Type Definition (DTDs)

Just as there are different types of books (phone book, novel, magazine) which have distinctly recognizable organizations, digital documents can be defined by their elements or organization. In XML, this notion of document type is specified in a Document Type Definition (DTD). The DTD is a set of rules that define the tags that can appear in the document and the way these tags are nested. The DTD describes element types, attributes, entities and notations used in the document. This definition tells what is legal within the specific type of document you have described. DTD are critical for standardization of the organizational structure of the documents and for the processing of the document by software. Thus, in XML, the DTD provide the formal definition of all elements types, attributes and entities that are allowable within a specified type of document.

Valid and Well-Formed Documents

It is important when you create an XML document that it be both valid and well-formed. A valid XML document will declare a specific DTD. Thus, your XML document is valid only if it conforms to the rules set forth in the specified DTD. DTD are not necessary for all XML documents, but provides a higher level of organizational structure to your documents. The existence of a DTD also requires others using your tags to conform to the rules you have defined.

A well-formed document is one that has all characteristics that make it uniformly useful when distributed over networks. The following characteristics determine if a document is well `formed.

• The document must start with the declaration <?xml version=”1.0”?>.

• There must be a root element that contains all other elements just like the <HTML> element for an HTML document.

• All tags must be properly nested.

• All beginning and ending tags MUST be included and match in case.

• Empty tags use a special XML syntax which indicate they are closed (i.e. <br/>.

• All attribute values are within quotes (i.e. <td height="25" width="50">).

XML Basics

An Example: Comparing HTML and XML Markup

Let’s first look at the difference between HTML and XML markup. XML allows for the creation of tags that represent the information we are trying to transmit. If we are creating a contact list for the Web, represent it in HTML as follows:

[[format this for better display, using indents, etc.]]

<html>

<head>

<title>

My Contact List

</title>

</head>

<body bgcolor=”white”>

<ul>

<li>Homer Simpson

<ul>

<li>Client ID:100

<li>Company: Springfield Nuclear Plant

<li>Email:

<li>Phone: (555) 555-1234

<li>Street Address: 123 Main St.

<li>City: Springfield

<li>State: Anystate

<li>Zip: 10000

</ul>

</ul>

</body>

</html>

[[show screen shot of this showing in browser]]

The HTL provides us with no specific information about the data. As humans, we can understand this as contact information about a specific person. But to the computer, it is just information without context. In order to use this information and put it into a database, we would need to strip out all the HTML tags and place the right information into the right fields in the database. In XML, this file this file would look like this:

<?xml version=”1.0”?>

<contact>

<name>Homer Simpson</name>

<id>100</id>

<company>Springfield Nuclear Plant</company>

<email></email>

<phone>: (555) 555-1234</phone>

<address>

<street>123 Main St.</street>

<city>Springfield</city>

<state>Anystate</state>

<zip>10000</zip>

</address>

</contact>

[[show screen shot of this showing in browser]]

The XML file above is logical and useful so that it make sense to a computer that wanted to logically read the file. Using the XML format above and application could be written to transfer the information into a database. It makes the information much more useable and searchable. Since these new tags make sense they can be easily used by all sort of application. XML is about structuring the data in a way that can make sense to the computer.

An Example: DTD

The above XML file is a valid XML file because it follows all the rules set for in the XLM standard. It is perfectly acceptable to create an XML document without a DTD. The hierarchy of tags can be inferred from the structure of the document. However, it does not stop someone from making a “grammatical” mistake when using the new markup language. Without a set of rules, anyone may omit parts of the structure without consequences. If we create a DTD, it provide the rules for the document and will require that all tags are included and nested properly.

Here is the same document with a proper DTD included.

<?xml version=”1.0”?>

<!DOCTYPE list [

<!ELEMENT contact (name, id, company, email, phone, address)>

<!ELEMENT address (street, city, state, zip)>

<!ELEMENT name (#PCDATA)>

<!ELEMENT id (#PCDATA)>

<!ELEMENT company (#PCDATA)>

<!ELEMENT email (#PCDATA)>

<!ELEMENT phone (#PCDATA)>

<!ELEMENT street (#PCDATA)>

<!ELEMENT city (#PCDATA)>

<!ELEMENT state (#PCDATA)>

<!ELEMENT zip (#PCDATA)>

]>

<list>

<contact>

<name>Homer Simpson</name>

<id>100</id>

<company>Springfield Nuclear Plant</company>

<email></email>

<phone>: (555) 555-1234</phone>

<address>

<street>123 Main St.</street>

<city>Springfield</city>

<state>Anystate</state>

<zip>10000</zip>

</address>

</contact>

</list>

Making Sense of the DTD

<!DOCTYPE list [

This line indicates that the list is a root element, and contains all other elements

<!ELEMENT contact (name, id, company, email, phone, address)>

This line defines the contact tag and the other tags that must appear inside the contact tag.

<!ELEMENT address (street, city, state, zip)>

This line defines the address tag and the other tags that must appear inside the address tag.

(#PCDATA)>

This indicates pared character data. Basically data there will be a string.

*

The asterisk indicates that either zero or more sets may occur. An example:

<!ELEMENT comments (#PCDATA *>

This would indicate that comment tags are optional or there may be multiple comment tags.

+