W3docs

Java XML SAX Parser

Stream large XML documents in Java with the event-driven SAX parser.

Java XML SAX Parser

SAX (Simple API for XML) is the JDK's event-driven, streaming XML parser. Instead of building a tree in memory the way DOM does, SAX reads the document once from start to finish and pushes events at you — "element started", "text seen", "element ended" — which you handle as they fly past. Because it never holds the whole document, SAX parses files of any size in a constant, tiny amount of memory. It lives in org.xml.sax and is created through javax.xml.parsers.SAXParserFactory, both part of the standard JDK with nothing to install.

Push parsing vs. building a tree

A DOM parser reads the entire document and hands you a navigable Document object — convenient, but it must fit every node in memory. SAX flips the control: the parser drives, calling methods on your handler as it encounters each piece of markup. You keep only the state you care about. The trade-off is that you cannot move backwards or look ahead — you see each event exactly once, in document order.

AspectSAXDOM
MemoryConstant, independent of file sizeProportional to document size
ModelPush: parser calls your callbacksPull/tree: you walk the loaded tree
NavigationForward only, single passRandom access, in any direction
ModificationRead onlyRead and write
Best forHuge files, extracting a subsetSmall/medium docs you need to edit

The factory and the handler

Two types do almost all the work. SAXParserFactory creates a SAXParser, and you subclass DefaultHandler to receive the events. DefaultHandler implements every callback as a no-op, so you override only the ones you need:

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);          // optional: report namespace URIs
SAXParser parser = factory.newSAXParser();

DefaultHandler handler = new DefaultHandler() {
  @Override
  public void startElement(String uri, String localName, String qName, Attributes attr) {
    System.out.println("start <" + qName + ">");
  }
};
parser.parse(new File("data.xml"), handler);

The core callbacks

These are the ContentHandler methods you override most often (DefaultHandler provides them all):

CallbackFired when
startDocument() / endDocument()The parse begins / ends
startElement(uri, localName, qName, attr)An opening tag is read; attr holds its attributes
endElement(uri, localName, qName)A closing tag is read
characters(ch, start, length)Text content is read — possibly in several chunks
error() / fatalError()The document is malformed or invalid

Two facts trip up beginners. First, characters is not guaranteed to deliver all of an element's text in one call — the parser may split it, so you accumulate into a StringBuilder and read it on endElement. Second, attribute values are available only inside startElement, via the Attributes argument:

@Override
public void startElement(String uri, String localName, String qName, Attributes attr) {
  String id = attr.getValue("id");           // by name
  for (int i = 0; i < attr.getLength(); i++) // or by index
    System.out.println(attr.getQName(i) + "=" + attr.getValue(i));
}

Tracking state across events

Because SAX gives you no tree, you keep the context. A common pattern is a flag set on startElement and cleared on endElement, plus a text buffer that you reset at each element start and consume at element end:

private final StringBuilder text = new StringBuilder();

@Override public void startElement(String u, String l, String q, Attributes a) {
  text.setLength(0);            // begin collecting fresh text
}
@Override public void characters(char[] ch, int start, int len) {
  text.append(ch, start, len);  // text may arrive in pieces
}
@Override public void endElement(String u, String l, String q) {
  if (q.equals("title")) System.out.println("title = " + text.toString().trim());
}

A worked example: tallying a catalog without a tree

This program parses a small book catalog held in a text block. The handler counts books, counts how many are in stock (read from a stock attribute), and sums every price — all while the parser streams the document a single time. Nothing but JDK classes are used.

java— editable, runs on the server

What to take from the run:

  • The three parsed: lines print in document order — Effective Java, Clean Code, Java Concurrency in Practice — proving SAX is a single forward pass: each endElement for price fires exactly once, in the order the books appear, never out of sequence.
  • books seen : 3 comes from incrementing a counter in startElement for every <book> tag. The count lives in your handler, not in any tree — SAX kept no nodes around, only the integer you chose to track.
  • in stock : 2 is read from the stock attribute via attr.getValue("stock"), available only inside startElement. Book b2 has stock="0" and is excluded, so two of the three qualify.
  • total price : 135.50 is the sum of 45.00 + 38.50 + 52.00, accumulated by reading each <price> element's text on its endElement. Pulling text at element end (not in characters) is the safe pattern, since characters can deliver text in multiple chunks.
  • The whole document was fed through a ByteArrayInputStream and consumed once; at no point did the program hold a DOM tree. That is exactly why SAX scales to multi-gigabyte files where DOM would exhaust the heap.

Practice

Practice

In a SAX handler, why do you typically accumulate text into a StringBuilder in characters() and read it in endElement(), rather than using the text directly inside characters()?