Java XML Introduction
An overview of XML processing in Java — DOM, SAX, StAX, and JAXB approaches.
Java XML Introduction
XML (Extensible Markup Language) is a text format for representing structured, hierarchical data using nested tags. Long before JSON dominated web APIs, XML was the default for configuration files, document formats, and message exchange — and it is still everywhere, from Maven pom.xml files to SOAP services to Office documents.
Java has rich, built-in XML support in the JDK: you do not need any external library to read or write XML. The javax.xml.parsers and org.w3c.dom packages, plus org.xml.sax and javax.xml.stream, give you three distinct parsing models. This chapter maps out what XML is and which model to reach for.
What XML looks like
An XML document is a tree of elements. Each element has a name, optional attributes, and either text content or nested child elements. There is always exactly one root element wrapping everything.
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="1" lang="en">
<title>Effective Java</title>
<price>45.00</price>
</book>
</catalog>Here <catalog> is the root, <book> is a child element with two attributes (id and lang), and <title> and <price> carry text. The XML declaration on the first line states the version and character encoding. Well-formed XML requires every opening tag to be closed and properly nested.
The three parsing models
The JDK offers three ways to read XML, each with a different trade-off between convenience and memory. Picking the right one is the most important XML decision you will make.
| Model | Style | Memory | Best for |
|---|---|---|---|
| DOM | Loads the whole tree into memory | High | Random access, editing, small/medium docs |
| SAX | Push events as it scans (callbacks) | Low | Large docs, read-only streaming |
| StAX | Pull events on demand (cursor) | Low | Large docs, with simpler control flow |
DOM builds a complete in-memory tree you can navigate freely and modify. SAX fires callbacks (startElement, characters, endElement) as it reads, never holding the whole document. StAX is also streaming but lets your code pull the next event when ready, which is usually easier to follow than SAX callbacks.
DOM: the in-memory tree
DOM is the most convenient model when documents are small enough to fit in memory. You parse once, then walk or query the tree as often as you like.
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("catalog.xml");
NodeList books = doc.getElementsByTagName("book");
System.out.println("Books: " + books.getLength());getElementsByTagName returns a live NodeList; you index into it and cast nodes to Element to read attributes and child text.
SAX and StAX: streaming
When a document is too large to hold in memory, you stream it. SAX pushes events to a handler you supply:
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String local, String name, Attributes a) {
System.out.println("Start: " + name);
}
};
SAXParserFactory.newInstance().newSAXParser()
.parse("catalog.xml", handler);StAX gives you a cursor you advance yourself, which many find clearer:
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLStreamConstants;
import java.io.FileReader;
XMLStreamReader r = XMLInputFactory.newInstance()
.createXMLStreamReader(new FileReader("catalog.xml"));
while (r.hasNext()) {
if (r.next() == XMLStreamConstants.START_ELEMENT) {
System.out.println("Start: " + r.getLocalName());
}
}A self-contained example
The runnable example below uses only JDK classes — no Jackson or JAXB needed. It parses an XML catalog from an in-memory string with DOM, walks the <book> elements, reads attributes and child text, and sums the prices.
What to take from the run:
- DOM parsing needs no external dependency —
DocumentBuilderFactoryandorg.w3c.domship with the JDK, which is why the program prints results with nothing on the classpath. - The root element name printed as
catalogconfirms there is exactly one root wrapping the whole document. getElementsByTagName("book")returned aNodeListof length 2, so you index it like a list and cast each item toElement.- Attributes (
id) are read withgetAttribute, while text content (title,price) is read withgetTextContent— they are different kinds of data on the same element. - Because the entire tree is in memory, summing prices across all books to
$83.50is just a loop with random access — the convenience that makes DOM worth its memory cost.
Practice
Which XML parsing model loads the entire document into memory as a navigable tree?