Java XML SAX Parser | W3Docs Learn Java

SAX (Simple API for XML) is the JDK's event-driven, streaming XML parser. Instead of building a tree in memory the way DOM does, SAX reads the document once from start to finish and pushes events at you — "element started", "text seen", "element ended" — which you handle as they fly past. Because it never holds the whole document, SAX parses files of any size in a constant, tiny amount of memory. It lives in org.xml.sax and is created through javax.xml.parsers.SAXParserFactory, both part of the standard JDK with nothing to install.

This page covers how push parsing differs from building a tree, the factory-and-handler setup, the callbacks you override, how to track state across events, error handling, and a complete runnable example. If you are new to XML in Java, start with the XML introduction; when you need random access or want to edit a document, reach for the DOM parser instead.

Push parsing vs. building a tree

A DOM parser reads the entire document and hands you a navigable Document object — convenient, but it must fit every node in memory. SAX flips the control: the parser drives, calling methods on your handler as it encounters each piece of markup. You keep only the state you care about. The trade-off is that you cannot move backwards or look ahead — you see each event exactly once, in document order.

Aspect	SAX	DOM
Memory	Constant, independent of file size	Proportional to document size
Model	Push: parser calls your callbacks	Pull/tree: you walk the loaded tree
Navigation	Forward only, single pass	Random access, in any direction
Modification	Read only	Read and write
Best for	Huge files, extracting a subset	Small/medium docs you need to edit

The factory and the handler

Two types do almost all the work. SAXParserFactory creates a SAXParser, and you subclass DefaultHandler to receive the events. DefaultHandler implements every callback as a no-op, so you override only the ones you need:

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setNamespaceAware(true);          // optional: report namespace URIs
SAXParser parser = factory.newSAXParser();

DefaultHandler handler = new DefaultHandler() {
  @Override
  public void startElement(String uri, String localName, String qName, Attributes attr) {
    System.out.println("start <" + qName + ">");
  }
};
parser.parse(new File("data.xml"), handler);

The core callbacks

These are the ContentHandler methods you override most often (DefaultHandler provides them all):

Callback	Fired when
`startDocument()` / `endDocument()`	The parse begins / ends
`startElement(uri, localName, qName, attr)`	An opening tag is read; `attr` holds its attributes
`endElement(uri, localName, qName)`	A closing tag is read
`characters(ch, start, length)`	Text content is read — possibly in several chunks
`error()` / `fatalError()`	The document is malformed or invalid

Two facts trip up beginners. First, characters is not guaranteed to deliver all of an element's text in one call — the parser may split it, so you accumulate into a StringBuilder and read it on endElement. Second, attribute values are available only inside startElement, via the Attributes argument:

@Override
public void startElement(String uri, String localName, String qName, Attributes attr) {
  String id = attr.getValue("id");           // by name
  for (int i = 0; i < attr.getLength(); i++) // or by index
    System.out.println(attr.getQName(i) + "=" + attr.getValue(i));
}

Tracking state across events

Because SAX gives you no tree, you keep the context. A common pattern is a flag set on startElement and cleared on endElement, plus a text buffer that you reset at each element start and consume at element end:

private final StringBuilder text = new StringBuilder();

@Override public void startElement(String u, String l, String q, Attributes a) {
  text.setLength(0);            // begin collecting fresh text
}
@Override public void characters(char[] ch, int start, int len) {
  text.append(ch, start, len);  // text may arrive in pieces
}
@Override public void endElement(String u, String l, String q) {
  if (q.equals("title")) System.out.println("title = " + text.toString().trim());
}

A worked example: tallying a catalog without a tree

This program parses a small book catalog held in a text block. The handler counts books, counts how many are in stock (read from a stock attribute), and sums every price — all while the parser streams the document a single time. Nothing but JDK classes are used.

java— editable, runs on the server

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;
import java.io.ByteArrayInputStream;
import java.nio.charset.StandardCharsets;

public class SaxParserDemo {
  public static void main(String[] args) throws Exception {
    String xml = """
        <?xml version=\"1.0\" encoding=\"UTF-8\"?>
        <catalog>
          <book id=\"b1\" stock=\"3\">
            <title>Effective Java</title>
            <price currency=\"USD\">45.00</price>
          </book>
          <book id=\"b2\" stock=\"0\">
            <title>Clean Code</title>
            <price currency=\"USD\">38.50</price>
          </book>
          <book id=\"b3\" stock=\"7\">
            <title>Java Concurrency in Practice</title>
            <price currency=\"USD\">52.00</price>
          </book>
        </catalog>
        """;

// A handler collects results as the parser pushes events at it.
    class StockHandler extends DefaultHandler {
      int bookCount = 0;
      int inStock = 0;
      double total = 0.0;
      String currentTitle = "";
      StringBuilder chars = new StringBuilder();
      boolean readingPrice = false;

@Override
      public void startElement(String uri, String localName, String qName, Attributes attr) {
        chars.setLength(0); // reset text buffer at every element start
        switch (qName) {
          case "book" -> {
            bookCount++;
            if (Integer.parseInt(attr.getValue("stock")) > 0) inStock++;
          }
          case "price" -> readingPrice = true;
        }
      }

@Override
      public void characters(char[] ch, int start, int length) {
        chars.append(ch, start, length); // text may arrive in several chunks
      }

@Override
      public void endElement(String uri, String localName, String qName) {
        switch (qName) {
          case "title" -> currentTitle = chars.toString().trim();
          case "price" -> {
            if (readingPrice) total += Double.parseDouble(chars.toString().trim());
            readingPrice = false;
            System.out.println("parsed: " + currentTitle);
          }
        }
      }
    }

SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser parser = factory.newSAXParser();
    StockHandler handler = new StockHandler();
    parser.parse(new ByteArrayInputStream(xml.getBytes(StandardCharsets.UTF_8)), handler);

System.out.println("books seen   : " + handler.bookCount);
    System.out.println("in stock     : " + handler.inStock);
    System.out.printf("total price  : %.2f%n", handler.total);
  }
}

What to take from the run:

The three parsed: lines print in document order — Effective Java, Clean Code, Java Concurrency in Practice — proving SAX is a single forward pass: each endElement for price fires exactly once, in the order the books appear, never out of sequence.
books seen : 3 comes from incrementing a counter in startElement for every <book> tag. The count lives in your handler, not in any tree — SAX kept no nodes around, only the integer you chose to track.
in stock : 2 is read from the stock attribute via attr.getValue("stock"), available only inside startElement. Book b2 has stock="0" and is excluded, so two of the three qualify.
total price : 135.50 is the sum of 45.00 + 38.50 + 52.00, accumulated by reading each <price> element's text on its endElement. Pulling text at element end (not in characters) is the safe pattern, since characters can deliver text in multiple chunks.
The whole document was fed through a ByteArrayInputStream and consumed once; at no point did the program hold a DOM tree. That is exactly why SAX scales to multi-gigabyte files where DOM would exhaust the heap.

Handling malformed XML

SAX reports problems through three ErrorHandler callbacks, all overridable on DefaultHandler:

Callback	Meaning	Parsing continues?
`warning(SAXParseException e)`	Minor issue (e.g. a recoverable DTD warning)	Yes
`error(SAXParseException e)`	A validity error against a DTD/schema	Yes, unless you re-throw
`fatalError(SAXParseException e)`	Well-formedness violation (broken markup)	No — the parse stops

By default parse() throws a SAXParseException on a fatal error, so wrapping the call in a try/catch is enough for most code. The exception carries getLineNumber() and getColumnNumber(), which makes it easy to point at the offending markup:

try {
  parser.parse(new File("data.xml"), handler);
} catch (SAXParseException e) {
  System.err.println("bad XML at line " + e.getLineNumber()
      + ", column " + e.getColumnNumber() + ": " + e.getMessage());
}

Warning

If your handler throws an unchecked exception (for example a NumberFormatException from parsing an attribute), it propagates straight out of parse() and aborts the stream. Validate or guard attribute values inside the callback rather than assuming the input is well-formed.

Practice

In a SAX handler, why do you typically accumulate text into a StringBuilder in characters() and read it in endElement(), rather than using the text directly inside characters()?

Because the parser may report an element's text content across several separate characters() calls, so only at endElement() is the full text guaranteed to be assembledBecause characters() cannot access String objects, only raw char arrays that must be converted laterBecause endElement() runs on a different thread and needs the text handed off through a bufferBecause SAX forbids reading attribute values until the closing tag has been seen