SAX: A Simple Event-Base XML Processor

SAX is an API used to parse XML documents. It is based on events generated while reading through the document. Callback methods receive those events. A custom handler contains those callback methods.

The API is efficient because it drops events right after the callbacks received them. Therefore, SAX has efficient memory management, unlike DOM, for example.

(DOM stands for Document Object Model. The DOM parser does not rely on events. Moreover, it loads the whole XML document into memory to parse it. SAX is more memory-efficient than DOM.)

The Java implementation of SAX (org.xml.sax) is considered to be normative since there is no formal specification.

A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).

XML processing with SAX

A parser that implements SAX (i.e., a SAX Parser) functions as a stream parser, with an event-driven API. The user defines a number of callback methods that will be called when events occur during parsing. The SAX events include (among others):

Some events correspond to XML objects that are easily returned all at once, such as comments. However, XML elements can contain many other XML objects, and so SAX represents them as does XML itself: by one event at the beginning, and another at the end. Properly speaking, the SAX interface does not deal in elements, but in events that largely correspond to tags. SAX parsing is unidirectional; previously parsed data cannot be re-read without starting the parsing operation again.

There are many SAX-like implementations in existence. In practice, details vary, but the overall model is the same. For example, XML attributes are typically provided as name and value arguments passed to element events, but can also be provided as separate events, or via a hash table or similar collection of all the attributes. For another, some implementations provide Init and Fin callbacks for the very start and end of parsing; others do not. The exact names for given event types also vary slightly between implementations.

(Some SAX implementations provide a separate event just for the XML declaration).

Expat

Expat is a stream-oriented XML 1.0 parser library, written in C, more precisely C99. As one of the first available open-source XML parsers, Expat has found a place in many open-source projects. Such projects include the Apache HTTP Server, Mozilla, Perl, Python and PHP. It is also bound in many other languages.

GitHub hosts the Expat project. Versions exist for most major operating-systems.

Timeline

Software developer James Clark released version 1.0 in 1998 while serving as technical lead on the XML Working Group at the World Wide Web Consortium. Clark released two more versions, 1.1 and 1.2, before turning the project over to a group led by Clark Cooper and Fred Drake in 2000. The new group released version 1.95.0 in September 2000 and continues to release new versions to incorporate bug fixes and enhancements.

Versions up to 2.5.0 have a Score 7.5 (High) DoS vulnerability CVE-2023-52425.

Deployment

To use the Expat library, programs first register handler functions with Expat. When Expat parses an XML document, it calls the registered handlers as it finds relevant tokens in the input stream. These tokens and their associated handler calls are called events. Typically, programs register handler functions for XML element start or stop events and character events. Expat provides facilities for more sophisticated event handling such as XML Namespace declarations, processing instructions and DTD events.

Expat's parsing events resemble the events defined in the Simple API for XML (SAX), but Expat is not a SAX-compliant parser. Projects incorporating the Expat library often build SAX and possibly DOM parsers on top of Expat. While Expat is mainly a stream-based (push) parser, it supports stopping and restarting parsing at arbitrary times, thus making the implementation of a pull parser relatively easy as well.

SAX vs StAX

StAX is more recent than SAX and DOM. It stands for Streaming API for XML.

The main difference with SAX is that StAX uses a pull mechanism instead of SAX’s push mechanism (using callbacks). This means the control is given to the client to decide when the events need to be pulled. Therefore, there is no obligation to pull the whole document if only a part of it is needed.

Like SAX, it provides an easy API to work with XML with a memory-efficient way of parsing.

Unlike SAX, it doesn’t provide schema validation as one of its features.