In This Tutorial, We Will Cover These Following Topics

DOM Tutorial

In this tutorial, we will cover these following topics

1. DOM Characteristics

2. DOM node tree and node types

3. DOM Programming

1. DOM Characteristics

Access XML document as a tree structure
Composed of mostly element nodes and text nodes
Can “walk” the tree back and forth
Larger memory requirements
Fairly heavyweight to load and store
Use it when walking and modifying the tree

2. DOM Tree and Nodes

XML document is represented as a tree
A tree is made of nodes
There are 12 different node types
Document node
Document Fragment node
Element node
Attribute node
Text node
Comment node
Processing instruction node
Document type node
Entity node
Entity reference node
CDATA section node
Notation node
Nodes may contain other nodes

(depending on node types)

Parent nodes contain child nodes

Example 1. An XML-RPC request document

<?xml version="1.0"?>

<?xml-stylesheet type="text/css" href="xml-rpc.css"?>

<!-- It's unusual to have an xml-stylesheet processing

instruction in an XML-RPC document but it is legal, unlike

SOAP where processing instructions are forbidden. -->

<!DOCTYPE methodCall SYSTEM "xml-rpc.dtd">

<methodName>getQuote</methodName>

<param>

<value<string>PTT</string</value>

</param>

</params>

</methodCall>

The document node representing the root of this document has four child nodes in this order

A processing instruction node for the xml-stylesheet processing instruction
A comment node for the comment
A document type node for the document type declaration
An element node for the root methodCall element

The XML declaration and the white space between these nodes are not included in the tree. They are not part of the model, and the parser does not include them in the tree it builds

Each element node has a name, a local name, a namespace URI (which may be null if the element is not in any namespace) and a prefix (which may also be null). It also contains children. For example, consider this valueelement

<value<string>PTT</string</value

When represented in DOM, it becomes a single element node with the name value

This node has a single element node child for the string element. The stringelement has a single text node child containing the text PTT.

Example 2: Elements with Namespaces

Consider this db:para element. In DOM it’s represented as an element node with the name db:para, the local name para, the prefix db, and the namespace URI It has three children:

<db:para xmlns:db="

xmlns="

or consider this <markup>para</markup> element:

</db:para>

A text node containing the text Or consider this
An element node with the name markup, the local name markup, the namespace URI and a null prefix.
Another text node containing the text element:.

White space is included in text nodes, even if it’s ignorable. For example, consider the methodCall element in Example 3.

Example 3: Text Nodes with Whitespaces

<methodName>getQuote</methodName>

<param>

<value<string>PTT</string</value>

</param>

</params>

</methodCall>

It is represented as an element node with the name methodCall and five child nodes:

A text node containing only white space
An element node with the name methodName
A text node containing only white space
An element node with the name params
A text node containing only white space

Of course, these element nodes also have their own child nodes.

As well as element and text nodes, an element node can also contain comment and processing instruction nodes. Depending on how the parser behaves, an element node might also contain some CDATA section nodes and/or entity reference nodes. However, many parsers resolve these automatically into their component text and element nodes and do not report them separately.

An attribute node has a name, a local name, a prefix, a namespace URI, and a string value. The value is normalized as required by the XML 1.0 specification, such as all white spaces are converted into a single white space

Attributes are not considered to be children of the element they’re attached to. Instead they are part of a separate set of nodes. For example, consider this Quantity element:

Example 4: An Element with an Attribute

This element has no children, but it does have a single attribute with the name amount and the value 17.

3. DOM Programming

DOM Programming Procedures

Create a parser object
Set features and read properties
Parse XML documents and get Document object
Perform operations
Checking well-formedness
Traversing DOM
Manipulating DOM
Creating a new DOM
Writing out DOM

3.1 Checking well-formedness

The basic approach is as follows:

Use the static DocumentBuilderFactory.newInstance() factory method to return a DocumentBuilderFactory object.
Use the newDocumentBuilder() method of this DocumentBuilderFactory object to return a parser-specific instance of the abstract DocumentBuilder class.
Use one of the five parse() methods of DocumentBuilder to read the XML document and return an org.w3c.dom.Document object.

Creating the Basic Program

Start with a normal basic logic for an application, and check to make sure that an argument has been supplied on the command line:

public class JAXPChecker {

public static void main(String argv[])

{

if (argv.length != 1)

{

System.err.println(“Usage: java JAXPChecker <XML filename>”);

System.exit(1);

}

String documentName = argv[0];

}//main

} // JAXPChecker

1) Import the Required Classes

Add these lines to import the JAXP APIs you’ll be using:

import javax.xml.parsers.DocumentBuilder;

import javax.xml.parsers.DocumentBuilderFactory;

import javax.xml.parsers.FactoryConfigurationError;

import javax.xml.parsers.ParserConfigurationException;

Add these lines for the exceptions that can be thrown when the XML document is parsed

import org.xml.sax.SAXException;

import org.xml.sax.SAXParseException;

Add these lines to read the sample XML file and identify errors:

import java.io.File;

import java.io.IOException;

Finally, import the W3C definition for a DOM and DOM exceptions:

import org.w3c.dom.Document;

import org.w3c.dom.DOMException;

2) Declare the DOM

The org.w3c.dom.Document class is the W3C name for a document in a DOM. Whether you parse an XML document or create one, a Document instance will result. We’ll want to reference that object from another method later on. Now add the code ‘static Document document’ like this:

public class JAXPChecker

{

static Document document;

pubic static void main(String argv[])

{

3) Instantiate the Factory

Next, add the code highlighted below to obtain an instance of a factory that can give us a document builder:

Public static void main(String argv[])

{

if (argv.length != 1)

{

.…

}

String documentName = argv[0];

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

}

4) Get a Parser and Parse the File

Now, add the code highlighted below to get an instance of a builder, and use it to parse the specified file:

try {

DocumentBuilder builder = factory.newDocumentBuilder();

document = builder.parse(new File(documentName));

System.out.println(documentName + “ is well-formed”);

} catch (SAXException sxe)

{

System.out.println(documentName + " is not well-formed");

Exception x = sxe;

if (sxe.getException() != null)

x = sxe.getException();

x.printStackTrace();

}

catch (ParserConfigurationException pce)

{

System.out.println("Parser configuration error when parsing " + documetName);

pce.printStackTrace();

} catch (IOException ioe)

{

System.out.println("Due to an IOException, the parser could not check " + documentName);

ioe.printStackTrace();

}

} // main

Note that a JAXP-conformant document builder is required to report SAX exceptions when it has trouble parsing the XML document. The DOM parser does not have to actually use a SAX parser internally, but since the SAX standard was already there, it seemed to make sense to use it for reporting errors. The code inside catch (SAXException sxe) {…} tests to see if the exception contains only a message, the code prints the stack trace starting from the location where the exception was generated.

3.2 Using Xerces for well-formedness

Example 5: XercesChecker.java

import org.apache.xerces.parsers.DOMParser;

import org.xml.sax.SAXException;

import java.io.IOException;

public class XercesChecker {

public static void main(String[] args) {

if (args.length <= 0) {

System.out.println("Usage: java XercesChecker URL");

return;

}

String document = args[0];

DOMParser parser = new DOMParser();

try {

parser.parse(document);

System.out.println(document + " is well-formed.");

}

catch (SAXException e) {

System.out.println(document + " is not well-formed.");

}

catch (IOException e) {

System.out.println(

"Due to an IOException, the parser could not check "

+ document

);

}

Before we traverse the DOM, it is useful to have the class that prints the properties of each node so that we know at which node we are currently.

PropetryPrinter is a simple utility class that accepts a Node as an argument and prints out the values of its non-null properties. Again, we’ll be using this class shortly in another program.

Example 6: PropertyPrinter.java

import org.w3c.dom.*;

import java.io.*;

public class PropertyPrinter {

private Writer out;

public PropertyPrinter(Writer out) {

if (out == null) {

throw new NullPointerException("Writer must be non-null.");

}

this.out = out;

}

public PropertyPrinter() {

this(new OutputStreamWriter(System.out));

}

private int nodeCount = 0;

public void writeNode(Node node) throws IOException {

if (node == null) {

throw new NullPointerException("Node must be non-null.");

}

if (node.getNodeType() == Node.DOCUMENT_NODE

|| node.getNodeType() == Node.DOCUMENT_FRAGMENT_NODE) {

// starting a new document, reset the node count

nodeCount = 1;

}

String name = node.getNodeName(); // never null

String type = NodeTyper.getTypeName(node); // never null

String localName = node.getLocalName();

String uri = node.getNamespaceURI();

String prefix = node.getPrefix();

String value = node.getNodeValue();

StringBuffer result = new StringBuffer();

result.append("Node " + nodeCount + ":\r\n");

result.append(" Type: " + type + "\r\n");

result.append(" Name: " + name + "\r\n");

if (localName != null) {

result.append(" Local Name: " + localName + "\r\n");

}

if (prefix != null) {

result.append(" Prefix: " + prefix + "\r\n");

}

if (uri != null) {

result.append(" Namespace URI: " + uri + "\r\n");

}

if (value != null) {

result.append(" Value: " + value + "\r\n");

}

out.write(result.toString());

out.write("\r\n");

out.flush();

nodeCount++;

}

3.2 Traversing the tree

You will learn how to navigate the tree by finding the parent, first child, last child, previous and next siblings, and attributes of any node. Since not all nodes have children, you should test for the presence of these things with hasChildren() before calling the getFirstChild() and getLastChild() methods. You should also be prepared for any of these methods to return null in the event that the requested node doesn’t exist. Similarly, you should check hasAttributes() before calling the getAttributes() method.

TreeReporter demonstrates with a simple program that recursively traverses the tree in a preorder fashion. As each node is visited, its name and value is printed using last section’s PropertyPrinter class. Once again, Node is the only class used from DOM. That’s the power of polymorphism. You can do quite a lot without knowing exactly what it is you’re doing it to.

Example 7: TreeReporter.java

import javax.xml.parsers.*; // JAXP

import org.w3c.dom.Node;

import org.xml.sax.SAXException;

import java.io.IOException;

public class TreeReporter {

public static void main(String[] args) {

if (args.length <= 0) {

System.out.println("Usage: java TreeReporter URL");

return;

}

TreeReporter iterator = new TreeReporter();

try {

// Use JAXP to find a parser

DocumentBuilderFactory factory

= DocumentBuilderFactory.newInstance();

// Turn on namespace support

factory.setNamespaceAware(true);

DocumentBuilder parser = factory.newDocumentBuilder();

// Read the entire document into memory

Node document = parser.parse(args[0]);

// Process it starting at the root

iterator.followNode(document);

}

catch (SAXException e) {

System.out.println(args[0] + " is not well-formed.");

System.out.println(e.getMessage());

}

catch (IOException e) {

System.out.println(e);

}

catch (ParserConfigurationException e) {

System.out.println("Could not locate a JAXP parser");

}

} // end main

private PropertyPrinter printer = new PropertyPrinter();

// note use of recursion

public void followNode(Node node) throws IOException {

printer.writeNode(node);

if (node.hasChildNodes()) {

Node firstChild = node.getFirstChild();

followNode(firstChild);

}

Node nextNode = node.getNextSibling();

if (nextNode != null) followNode(nextNode);

}

Here’s the output produced by running the TreeReporter program

C:\>java TreeReporter ex2.xml

Node 1:

Type: Document

Name: #document

Node 2:

Type: Element

Name: db:para

Local Name: para

Prefix: db

Namespace URI:

Node 3:

Type: Text

Name: #text

Value:

or consider this

Node 4:

Type: Element

Name: markup

Local Name: markup

Namespace URI:

Node 5:

Type: Text

Name: #text

Value: para

Node 6:

Type: Text

Name: #text

Value: element:

3.3Manipulating DOM

The Node interface has four methods that change the tree by inserting, removing, replacing, and appending children at points specified by nodes in the tree:

publicNodeinsertBefore(NodetoBeInserted, NodetoBeInsertedBefore)
throwsDOMException;
publicNodereplaceChild(NodetoBeInserted, NodetoBeReplaced)
throwsDOMException;
publicNoderemoveChild(NodetoBeRemoved)
throwsDOMException;
publicNodeappendChild(NodetoBeAppended)
throwsDOMException;

All four of these methods throw a DOMException if you try to use them to make a document malformed; for instance, by removing the root element or appending a child to a text node. All four methods return the node being inserted/replaced/removed/appended.

Restructurer is a program that moves all processing instruction nodes from inside the root element to before the root element and all comment nodes from inside the root element to after the root element. For example, this document:

Example 8: An XML Document for Manipulating

<?xml version="1.0"?>

Some data

<?processing instruction ?>

</document>

Example 9: Restructuring.java

import javax.xml.parsers.*;

import org.w3c.dom.*;

import org.xml.sax.SAXException;

import java.io.IOException;

public class Restructurer {

// Since this method only operates on its argument and does

// not interact with any fields in the class, it's

// plausibly made static.

public static void processNode(Node current)

throws DOMException {

// I need to store a reference to the current node's next

// sibling before we delete the node from the tree, in which

// case it no longer has a sibling

Node nextSibling = current.getNextSibling();

int nodeType = current.getNodeType();

if (nodeType == Node.COMMENT_NODE

|| nodeType == Node.PROCESSING_INSTRUCTION_NODE) {

Node document = current.getOwnerDocument();

// Find the root element by looping through the children of

// the document until we find the only one that's an

// element node. There's a quicker way to do this once we

// learn more about the Document class in the next chapter.

Node root = document.getFirstChild();

while (!(root.getNodeType() == Node.ELEMENT_NODE )) {

root = root.getNextSibling();

}

Node parent = current.getParentNode();

parent.removeChild(current);

if (nodeType == Node.COMMENT_NODE) {

document.appendChild(current);

}

else if (nodeType == Node.PROCESSING_INSTRUCTION_NODE) {

document.insertBefore(current, root);

}

else if (current.hasChildNodes()) {

Node firstChild = current.getFirstChild();

processNode(firstChild);

}

if (nextSibling != null) {

processNode(nextSibling);

}

public static void main(String[] args) {

if (args.length <= 0) {

System.out.println("Usage: java TreeReporter URL");

return;

}

Restructurer rc = new Restructurer();

try {

// Use JAXP to find a parser

DocumentBuilderFactory factory

= DocumentBuilderFactory.newInstance();

DocumentBuilder parser = factory.newDocumentBuilder();

Node document = parser.parse(args[0]);

System.out.println("Before restructuring");

rc.followNode(document);

// Process it starting at the root

rc.processNode(document);

System.out.println("After restructuring");

rc.followNode(document);

}

catch (SAXException e) {

System.out.println(args[0] + " is not well-formed.");

System.out.println(e.getMessage());

}

catch (IOException e) {

System.out.println(e);

}

catch (ParserConfigurationException e) {

System.out.println("Could not locate a JAXP parser");

}

} // end main

private PropertyPrinter printer = new PropertyPrinter();

// note use of recursion

public void followNode(Node node) throws IOException {

printer.writeNode(node);

if (node.hasChildNodes()) {

Node firstChild = node.getFirstChild();

followNode(firstChild);

}

Node nextNode = node.getNextSibling();

if (nextNode != null) followNode(nextNode);

}

This program walks the tree, calling the removeChild() method every time a comment or processing instruction node is spotted, and then inserting the processing instruction nodes before the root element with insertBefore() and the comment nodes after the root element with appendChild(). Both references to the document node, the root element node, and the nearest parent element node have to be stored at all times. The Document object is modified in place.

The output of the Restructurer program looks like this

Before restructuring

Node 1:

Type: Document

Name: #document

Node 2:

Type: Element

Name: document

Node 3:

Type: Text

Name: #text

Value:

Some data

Node 4:

Type: Comment

Name: #comment

Value: comment

Node 5:

Type: Text

Name: #text

Value:

Node 6:

Type: Processing Instruction

Name: processing

Value: instruction

Node 7:

Type: Text

Name: #text

Value:

After restructuring

Node 1: