Java XML SAX Parser
Stream large XML documents in Java with the event-driven SAX parser.
Java XML SAX Parser
SAX (Simple API for XML) is the JDK's event-driven, streaming XML parser. Instead of building a tree in memory the way DOM does, SAX reads the document once from start to finish and pushes events at you — "element started", "text seen", "element ended" — which you handle as they fly past. Because it never holds the whole document, SAX parses files of any size in a constant, tiny amount of memory. It lives in org.xml.sax and is created through javax.xml.parsers.SAXParserFactory, both part of the standard JDK with nothing to install.
Push parsing vs. building a tree
A DOM parser reads the entire document and hands you a navigable Document object — convenient, but it must fit every node in memory. SAX flips the control: the parser drives, calling methods on your handler as it encounters each piece of markup. You keep only the state you care about. The trade-off is that you cannot move backwards or look ahead — you see each event exactly once, in document order.
| Aspect | SAX | DOM |
|---|---|---|
| Memory | Constant, independent of file size | Proportional to document size |
| Model | Push: parser calls your callbacks | Pull/tree: you walk the loaded tree |
| Navigation | Forward only, single pass | Random access, in any direction |
| Modification | Read only | Read and write |
| Best for | Huge files, extracting a subset | Small/medium docs you need to edit |
The factory and the handler
Two types do almost all the work. SAXParserFactory creates a SAXParser, and you subclass DefaultHandler to receive the events. DefaultHandler implements every callback as a no-op, so you override only the ones you need:
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true); // optional: report namespace URIs
SAXParser parser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
@Override
public void startElement(String uri, String localName, String qName, Attributes attr) {
System.out.println("start <" + qName + ">");
}
};
parser.parse(new File("data.xml"), handler);The core callbacks
These are the ContentHandler methods you override most often (DefaultHandler provides them all):
| Callback | Fired when |
|---|---|
startDocument() / endDocument() | The parse begins / ends |
startElement(uri, localName, qName, attr) | An opening tag is read; attr holds its attributes |
endElement(uri, localName, qName) | A closing tag is read |
characters(ch, start, length) | Text content is read — possibly in several chunks |
error() / fatalError() | The document is malformed or invalid |
Two facts trip up beginners. First, characters is not guaranteed to deliver all of an element's text in one call — the parser may split it, so you accumulate into a StringBuilder and read it on endElement. Second, attribute values are available only inside startElement, via the Attributes argument:
@Override
public void startElement(String uri, String localName, String qName, Attributes attr) {
String id = attr.getValue("id"); // by name
for (int i = 0; i < attr.getLength(); i++) // or by index
System.out.println(attr.getQName(i) + "=" + attr.getValue(i));
}Tracking state across events
Because SAX gives you no tree, you keep the context. A common pattern is a flag set on startElement and cleared on endElement, plus a text buffer that you reset at each element start and consume at element end:
private final StringBuilder text = new StringBuilder();
@Override public void startElement(String u, String l, String q, Attributes a) {
text.setLength(0); // begin collecting fresh text
}
@Override public void characters(char[] ch, int start, int len) {
text.append(ch, start, len); // text may arrive in pieces
}
@Override public void endElement(String u, String l, String q) {
if (q.equals("title")) System.out.println("title = " + text.toString().trim());
}A worked example: tallying a catalog without a tree
This program parses a small book catalog held in a text block. The handler counts books, counts how many are in stock (read from a stock attribute), and sums every price — all while the parser streams the document a single time. Nothing but JDK classes are used.
What to take from the run:
- The three
parsed:lines print in document order —Effective Java,Clean Code,Java Concurrency in Practice— proving SAX is a single forward pass: eachendElementforpricefires exactly once, in the order the books appear, never out of sequence. books seen : 3comes from incrementing a counter instartElementfor every<book>tag. The count lives in your handler, not in any tree — SAX kept no nodes around, only the integer you chose to track.in stock : 2is read from thestockattribute viaattr.getValue("stock"), available only insidestartElement. Bookb2hasstock="0"and is excluded, so two of the three qualify.total price : 135.50is the sum of45.00 + 38.50 + 52.00, accumulated by reading each<price>element's text on itsendElement. Pulling text at element end (not incharacters) is the safe pattern, sincecharacterscan deliver text in multiple chunks.- The whole document was fed through a
ByteArrayInputStreamand consumed once; at no point did the program hold a DOM tree. That is exactly why SAX scales to multi-gigabyte files where DOM would exhaust the heap.
Practice
In a SAX handler, why do you typically accumulate text into a StringBuilder in characters() and read it in endElement(), rather than using the text directly inside characters()?