CIS 679 852

PROJECT

XML

Joanne M. Smith

May 1, 2001

INDEX

  • Introduction
  • A Brief History of Markup Languages
  • What is XML?
  • What Does XML Mean to Business in General?
  • Down To The Nitty Gritty
  • What Does XML Mean to the Future of Business?
  • Management Implications
  • Summary
  • A Word About Standards
  • Thoughts on the Defense Industry
  • References

INTRODUCTION

Originally, the Internet was a solution to publish scientific documents. Today, it has developed into a medium equal to print and television. Additionally, the Internet is an interactive medium, being utilized today to support such activities as online shopping, electronic banking, online trading etc…

The essential business purpose of the Internet is communication. Before the computer, even before the printing press, there existed international corporations with impressive communication networks. While the technology of these networks have changed dramatically over the years and the speed of the dissemination of information has changed, the goal has remained the same; “Get important information to concerned parties as quickly as possible to allow them to collaborate at a distance.”

Business has been successfully transacted over the years through the exchange of standardized documents, whether electronic or on papyrus. Documents work for business because the interacting parties do not need to know one another’s internal workings or procedures. The “document” allows for the parties involved to know what is required to transact the business at hand and no more. In the course of its business, a corporation may deal with a variety of different types of documents; purchase orders, memos, legal documents, invoices, receipts etc… Documents therefore, are the communication medium of the corporation, and the Internet is a vehicle by which they can be exchanged.

The most popular way to define and create documents on the Web today is with the use of the electronic publishing language, HTML (Hypertext Markup Language). HTML in short, is concerned mainly with appearances; it arranges text and images on a page. HTML produces Web sites that function similarly to a fax machine, merely sending documents to those that ask. As anyone who has utilized the Internet knows, whether doing research or placing an online order, while documents are readily available, it is difficult and frustrating finding the ones needed and can be a very slow and time consuming process exchanging documents back and forth. Individuals and businesses are requiring more from their Web sites. Developers must write applications that can run on any platform and allow everyone to view data in a similar way, no matter what system or operating environment they are using. Web sites require security mechanisms that protect valuable information even as data is made available to clients and vendors. Documents must be viewed as information, not just as a picture of a document. HTML is acceptable for displaying information to humans but not to be acted upon by computers.

A means to address this limitation of HTML appears easy; find a way to label or “tag” what information is, not what it merely looks like. An example would be to label the parts of an online order for a sweater not as font, paragraph, row and column; what HTML does, but as price, size, quantity and color. A program could then recognize the document for what it was, a customer order, and process it accordingly, through shipping, accounting, etc… XML or Extensible Markup Language is a new language designed to make information self-describing; it allows authors to describe the data in a document separately from the formatting of those documents and thus make a document more acceptable to computing.

The above is an overly simplified explanation. XML, Extensible Markup Language, has spread rapidly through all fields of science and into industries that run the gamut from manufacturing to medicine. XML is expected to revolutionize network-oriented applications, especially in the area of data interchange, how businesses communicate. The purpose of this paper is to introduce the concept of XML. Define what XML is, where it has come from. To explore some of the areas where XML may prove useful in both the near and long term and look at the implications these applications will have for business in general and management, with respect to the introduction and utilization of this technology, in particular.

A BRIEF HISTORY OF MARKUP LANGUAGES

XML is a markup language and one way to grasp what opportunities it may afford a business in the future is to understand what it is, how it evolved and what problems it was designed to solve.

Markup originated in the publishing industry. In traditional publishing, the manuscript is annotated with layout instructions for the typesetter. The handwritten annotations are called markup. Markup is a standalone activity that takes place after writing and before typesetting.

In the world of computers, word processing requires a user to specify the appearance of their text. An example is a user selecting a particular font for the text and then its position on the page. This information on font and position is too called markup and is stored as special codes with the text. This activity parallels the traditional markup activity with only one, but important difference; the markup information is stored electronically.

Early markup languages were usually invented by the companies that sold document processing software. As an alternative to these proprietary systems, SGML, “Standard Generalized Markup Language” was defined in 1986 as an international standard for document markup. SGML was founded on a generic coding concept with the purpose to devise a flexible, precise and descriptive vocabulary for expressing the contents of electronic documents. This meant that developers were no longer tied to a particular vendor’s markup language and could develop their own by using SGML. This afforded developers the opportunity to easily convert their documents into other formats by specifying certain details about their documents such as the names of its components or its structure. This in turn made documents more versatile by being easily readable across several different applications. A document processing application that knows a document’s structure can do more things with it more efficiently. SGML grew in popularity because it gave documents a level of previously unheard of portability. SGML became widely utilized by manufacturers, by insurance companies and by computer companies for their documentation needs. The language’s down side however, was its complexity. This complexity proved a serious limitation for its adoption in applications to be used by a large number of non-expert users over the Web or on corporate intranets. In answer to the issue of complexity came HTML, Hypertext Markup Language.

HTML is an SGML application defining a document type by using SGML syntax to indicate the purpose of each part of a document. HTML was developed in 1989 as a specific markup language to identify the structure of Web documents. Specifically it was designed to enable the transmission and display of hypertext documents, a document containing links to other documents, across a network.

In 1992 the W3C, World Wide Web Consortium, published the first HTML specification. The W3C is a collection of companies and universities around the world interested in developing and promoting common protocols for the Web’s evolution.

The invention and standardization of HTML has provided for the rapid growth of the Web. HTML has the advantages of providing developers a simple syntax with a fixed tag set (a tag is a word or word phrase used for identification purposes, for example where a paragraph starts and ends), made it easy to create multimedia documents by incorporating images and audio, and enabled many documents to be linked together. However, as the Web grew, HTML grew. It has grown into a very complex language as more and more “tags” were added to address the growing needs of Web users. As e-commerce continues to grow, yet even more tags are needed. The combination of tags is almost endless and the result of a particular combination of tags may be different from one browser to another, ambiguity is thus increased.

In addition, as more companies are using intranets within their organizations, the Web now serves as the interface for a variety of information systems. This utilization presents a much richer internal structure than can be represented in HTML. This limitation is preventing the Web from being used as a platform for information exchange on a large scale.

Couple the above limitations with the projection made by the W3C that by the year 2002, 75% of Web surfers will not be utilizing their PCs for this activity; instead, the Web will be accessed from Palm Pilots or smart phones, something needed to be done. Such devices as Palm Pilots and smart phones are not yet as powerful as a PC and cannot process a complex language like HTML. What was required was a way of providing the richness of SGML, with the ease of use of HTML for publishing and accessing documents on line. Quite simply, SGML was too much for what was required for the Web and HTML was not enough. XML was developed to address these issues.

WHAT IS XML?

In the summer of 1996, Jon Bosak of Sun Microsystems recruited a group of SGML experts and formed the XML Working Group, working under the auspices of the W3C. This group began to work on a version of SGML that would prove simpler to implement, especially with respect to delivering documents over the Web.

The design goals for XML, as called out in Clause 1.1 of the W3C recommendations, points to this need for XML to bridge the gap between SGML and HTML. The design goals are as follows:

  • XML shall be straightforwardly usable over the Internet.
  • XML shall support a wide variety of applications.
  • XML shall be compatible with SGML.
  • It shall be easy to write programs, which process XML documents.
  • The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  • XML documents should be human-legible and reasonably clear.
  • The SML design should be prepared quickly.
  • The design of XML shall be formal and concise.
  • XML documents shall be easy to create.
  • Terseness in XML markup is of minimal importance.

Working under the guidelines these goals produced, XML was created. By capturing these goals within its design, XML has produced benefits that can be grouped into four major categories:

  1. XML is extensible.
  2. XML had precise and deep structures.
  3. XML has developed two general document types.
  4. XML has powerful extensions.

The value to the user community that these benefits can translate into is considerable. By being extensible, XML is able to create its own elements. This means that documents can be customized according to the kind of information that requires processing. This gives the designer the power to customize documents to fit the needs of the business. HTML cannot do this; a designer is subject to the same set of elements.

No ambiguity exists in XML, which means that programmers have a clear structure to work with which makes applications easier to write and maintain. When clear structures are combined with extensibility, documents become flexible and reusable. Reusability is important because the same document may provide different information to different users depending on their information needs. Deep structures allow for a method of content management, which results in the ability of computers to process information effectively.

In SGML a document is always defined by a document type definition, (DTD). A DTD defines specifically what kinds of elements, attributes and entities may be in that document. In long documents, the DTD becomes very complex. When a document is processed, the DTD is accessed so that the document may be checked for validity. A document that adheres to all the rules of the DTD as well as the rules of the language specification is a valid document. HTML allows for many ways to “work around” the rules. The Web is full of HTML documents, which look fine to the human eye but are sloppy and break many rules, which makes processing of the information difficult.

XML denies a programmer the ability to break the rules of the specification by allowing for another type of document, the Well Formed Document. The well-formed document does not have to adhere to a DTD, but it must adhere to two rules about structure. (open and close tags for each element and one root element) This forces the issue of precise structure. It simplifies the process of validation and processing of documents.

While HTML’s simple linking mechanism has made it easy for Web development, it lacks the power to provide different linking capabilities such as multiple links. HTML only supports single point links. XML’s linking technology allows for bi-directional and multi-way links, as well as links to a span of text both within the same document or other documents.

In summary, in XML, the identification of what a piece of information is, is separated from information on how that information should be presented or processed. So what does that all boil down too? “HTML created a way for every computer user to read Internet documents. XML makes it possible, despite the Babel of incompatible computer systems, to create an Esperanto that all can read and write. Unlike most computer data formats, XML also makes sense to humans because it consists of nothing more than ordinary text.” (Bosak and Bray, 1999.)

WHAT DOES XML MEAN TO BUSINESS IN GENERAL?

What does XML provide to the business community? How can the above benefits be translated into the language of the definition of business goals and objectives and thus their attainment?

First, XML will enable a longer life span and a further reach for information. XML data is plain text. Compare that to word-processor file formats that change every two years or data that exists in proprietary databases, XML provides the freedom to use and reuse data without being tied to specific hardware or software. The XML language consists of rules anyone can follow to create a markup language that leverages existing infrastructure. The rules ensure that a single compact program can process all new languages. What this means is that it is possible to create a language that everyone can read and write. Even languages that use a different character set like Japanese or Russian can be read by software programmed in XML. Information is not only exchanged between different computers but across national and cultural boundaries as well.

Secondly, XML provides for a richer, more intelligent and easier to use Web. XML allows for a business or more importantly an entire industry or academic group to define a markup that describes their data. This will lead to improved search engines that can match on tags as well as text content and can support intelligent manipulation of data by a client since the markup is not tied to the appearance of the formatted data as in HTML.

Today, computing devices, whether PCs or pocket planners, connected to the Web don’t do much more than get a form, fill it out and send it back via a Web server. As XML is utilized by more and more Web sites, these devices will be able to do more of the processing on the spot. This will reduce the load on Web servers, reduce network traffic dramatically and will speed and ease the search for information.

Lastly, XML will improve computer-to-computer communication. XML’s text based data, its self-describing markup, the fact that its data can be validated, and the ready availability of processors that can be plugged in to other programs makes XML an impressive tool to be utilized in machine-to-machine communication.

Appling these enabling properties to actual business applications we can see XML being utilized in two categories of applications; document applications or data applications. XML can be applied in these applications areas immediately. The difference between the two is merely qualitative. In both application categories, it is the same XML standard being deployed, the same tools being utilized, just different goals being strived for. This however, is an important point because it means a business can apply or reuse the same tools across a number of different applications.

The use of XML for document publishing produces a distinct advantage. XML concentrates on the structure of the document and makes it independent of the delivery medium. It is possible therefore, to edit and maintain documents in XML and automatically publish them on different media. The key point here is automatically.

The ability to target multiple media is highly important because many publications are available online and in print. The Web with its rapidly changing environment, what is “in” this year will be “out” next, requires one to reformat their site regularly. In addition, some Web sites are optimized for specific viewers, this often leads to the development of two or more versions of the same site, one generic and one pointed at specific users, this is costly. All of these points make it clear that to main documentation in a common version in a media independent format that can be automatically converted into publishing formats is optimal. The more media we need to support and the larger the document, the more important that the publishing be automatic.