W3docs

Java XML Introduction

An overview of XML processing in Java — DOM, SAX, StAX, and JAXB approaches.

Java XML Introduction

XML (Extensible Markup Language) is a text format for representing structured, hierarchical data using nested tags. Long before JSON dominated web APIs, XML was the default for configuration files, document formats, and message exchange — and it is still everywhere, from Maven pom.xml files to SOAP services to Office documents.

Java has rich, built-in XML support in the JDK: you do not need any external library to read or write XML. The javax.xml.parsers and org.w3c.dom packages, plus org.xml.sax and javax.xml.stream, give you three distinct parsing models. This chapter maps out what XML is and which model to reach for.

What XML looks like

An XML document is a tree of elements. Each element has a name, optional attributes, and either text content or nested child elements. There is always exactly one root element wrapping everything.

<?xml version="1.0" encoding="UTF-8"?>
<catalog>
  <book id="1" lang="en">
    <title>Effective Java</title>
    <price>45.00</price>
  </book>
</catalog>

Here <catalog> is the root, <book> is a child element with two attributes (id and lang), and <title> and <price> carry text. The XML declaration on the first line states the version and character encoding. Well-formed XML requires every opening tag to be closed and properly nested.

The three parsing models

The JDK offers three ways to read XML, each with a different trade-off between convenience and memory. Picking the right one is the most important XML decision you will make.

ModelStyleMemoryBest for
DOMLoads the whole tree into memoryHighRandom access, editing, small/medium docs
SAXPush events as it scans (callbacks)LowLarge docs, read-only streaming
StAXPull events on demand (cursor)LowLarge docs, with simpler control flow

DOM builds a complete in-memory tree you can navigate freely and modify. SAX fires callbacks (startElement, characters, endElement) as it reads, never holding the whole document. StAX is also streaming but lets your code pull the next event when ready, which is usually easier to follow than SAX callbacks.

DOM: the in-memory tree

DOM is the most convenient model when documents are small enough to fit in memory. You parse once, then walk or query the tree as often as you like.

import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("catalog.xml");

NodeList books = doc.getElementsByTagName("book");
System.out.println("Books: " + books.getLength());

getElementsByTagName returns a live NodeList; you index into it and cast nodes to Element to read attributes and child text.

SAX and StAX: streaming

When a document is too large to hold in memory, you stream it. SAX pushes events to a handler you supply:

import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;

DefaultHandler handler = new DefaultHandler() {
    public void startElement(String uri, String local, String name, Attributes a) {
        System.out.println("Start: " + name);
    }
};
SAXParserFactory.newInstance().newSAXParser()
    .parse("catalog.xml", handler);

StAX gives you a cursor you advance yourself, which many find clearer:

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.XMLStreamConstants;
import java.io.FileReader;

XMLStreamReader r = XMLInputFactory.newInstance()
    .createXMLStreamReader(new FileReader("catalog.xml"));
while (r.hasNext()) {
    if (r.next() == XMLStreamConstants.START_ELEMENT) {
        System.out.println("Start: " + r.getLocalName());
    }
}

A self-contained example

The runnable example below uses only JDK classes — no Jackson or JAXB needed. It parses an XML catalog from an in-memory string with DOM, walks the <book> elements, reads attributes and child text, and sums the prices.

java— editable, runs on the server

What to take from the run:

  • DOM parsing needs no external dependency — DocumentBuilderFactory and org.w3c.dom ship with the JDK, which is why the program prints results with nothing on the classpath.
  • The root element name printed as catalog confirms there is exactly one root wrapping the whole document.
  • getElementsByTagName("book") returned a NodeList of length 2, so you index it like a list and cast each item to Element.
  • Attributes (id) are read with getAttribute, while text content (title, price) is read with getTextContent — they are different kinds of data on the same element.
  • Because the entire tree is in memory, summing prices across all books to $83.50 is just a loop with random access — the convenience that makes DOM worth its memory cost.

Practice

Practice

Which XML parsing model loads the entire document into memory as a navigable tree?