(C/C++) Libraries for Processing XML
RapidXml*
RapidXml is an attempt to create the fastest XML parser possible, while retaining useability, portability and reasonable W3C compatibility. It is an in-situ parser written in modern C++, with parsing speed approaching that of strlen function executed on the same data.
RapidXml has been around since 2006, and is being used by lots of people. HTC uses it in some of its mobile phones.
If you are looking for a stable and fast parser, look no further. Integration with your project will be trivial, because entire library is contained in a single header file, and requires no building or configuration.
Current version is RapidXml 1.13. Also available is its online Manual with a full and detailed reference. You may also like to check Boost.PropertyTree library, which presents a higher level interface, and uses RapidXml as its default XML parser.
The author of RapidXml is Marcin Kalicinski.
Boost.PropertyTree
(From https://www.boost.org/doc/libs/latest/doc/html/property_tree.html)
The Property Tree library provides a data structure that stores an arbitrarily deeply nested tree of values, indexed at each level by some key. Each node of the tree stores its own value, plus an ordered list of its subnodes and their keys. The tree allows easy access to any of its nodes by means of a path, which is a concatenation of multiple keys.
In addition, the library provides parsers and generators for a number of data formats that can be represented by such a tree, including XML, INI, and JSON.
Property trees are versatile data structures, but are particularly suited for holding configuration data. The tree provides its own, tree-specific interface, and each node is also an STL-compatible Sequence for its child nodes.
Conceptually, then, a node can be thought of as the following structure:
struct ptree
{
data_type data; // data associated with the node
list< pair<key_type, ptree> > children; // ordered list of named children
};
Both key_type and data_type are configurable to some extent, but will usually be std::string or std::wstring, and the parsers only work with this kind of tree.
Many software projects develop a similar tool at some point of their lifetime, and property tree originated the same way. We hope the library can save many from reinventing the wheel.
pugixml
pugixml is a light-weight C++ XML processing library. It features:
- DOM-like interface with rich traversal/modification capabilities
- Extremely fast non-validating XML parser which constructs the DOM tree from an XML file/buffer
- XPath 1.0 implementation for complex data-driven tree queries
- Full Unicode support with Unicode interface variants and automatic encoding conversions
The library is extremely portable and easy to integrate and use. You can download it in an archive (Windows/Unix line endings), get it from Git/Subversion repository, install it as a package in one of the major Linux/BSD distributions (Ubuntu, Debian, Fedora, Gentoo, Arch Linux, FreeBSD and more), install it as a package in one of the OSX package managers (Homebrew, MacPorts), install a NuGet package or use one of the alternative package managers (Conda).
pugixml is developed and maintained since 2006 and has many users. All code is distributed under the MIT license, making it completely free to use in both open-source and proprietary applications.
Document object model
pugixml stores XML data in DOM-like way: the entire XML document (both document structure and element data) is stored in memory as a tree. The tree can be loaded from a character stream (file, string, C++ I/O stream), then traversed with the special API or XPath expressions. The whole tree is mutable: both node structure and node/attribute data can be changed at any time. Finally, the result of document transformations can be saved to a character stream (file, C++ I/O stream or custom transport).
Tree structure
The XML document is represented with a tree data structure. The root of the tree is the document itself, which corresponds to C++ type xml_document. Document has one or more child nodes, which correspond to C++ type xml_node. Nodes have different types; depending on a type, a node can have a collection of child nodes, a collection of attributes, which correspond to C++ type xml_attribute, and some additional data (i.e. name).
The tree nodes can be of one of the following types (which together form the enumeration xml_node_type):
- Document node (
node_document) - this is the root of the tree, which consists of several child nodes. This node corresponds toxml_documentclass; note thatxml_documentis a sub-class ofxml_node, so the entire node interface is also available. However, document node is special in several ways, which are covered below. There can be only one document node in the tree; document node does not have any XML representation. Document generally has one child element node (see document_element()), although documents parsed from XML fragments (seeparse_fragment) can have more than one. -
Element/tag node (
node_element) - this is the most common type of node, which represents XML elements. Element nodes have a name, a collection of attributes and a collection of child nodes (both of which may be empty). The attribute is a simple name/value pair. The example XML representation of element nodes is as follows:<node attr="value"><child/></node>
There are two element nodes here: one has name "node", single attribute "attr" and single child "child", another has name "child" and does not have any attributes or child nodes.
-
Plain character data nodes (
node_pcdata) represent plain text in XML. PCDATA nodes have a value, but do not have a name or children/attributes. Note that plain character data is not a part of the element node but instead has its own node; an element node can have several child PCDATA nodes. The example XML representation of text nodes is as follows:<node> text1 <child/> text2 </node>
Here "node" element has three children, two of which are PCDATA nodes with values " text1 " and "text2".
-
Character data nodes (
node_cdata) represent text in XML that is quoted in a special way. CDATA nodes do not differ from PCDATA nodes except in XML representation - the above text example looks like this with CDATA:<node> <![CDATA[text1]]> <child/> <![CDATA[text2]]> </node>
CDATA nodes make it easy to include non-escaped <, & and > characters in plain text. CDATA value can not contain the character sequence
]]>, since it is used to determine the end of node contents. -
Comment nodes (
node_comment) represent comments in XML. Comment nodes have a value, but do not have a name or children/attributes. The example XML representation of a comment node is as follows:<!-- comment text -->
Here the comment node has value "comment text". By default comment nodes are treated as non-essential part of XML markup and are not loaded during XML parsing. You can override this behavior with
parse_commentsflag. -
Processing instruction node (
node_pi) represent processing instructions (PI) in XML. PI nodes have a name and an optional value, but do not have children/attributes. The example XML representation of a PI node is as follows:<?name value?>
Here the name (also called PI target) is "name", and the value is "value". By default PI nodes are treated as non-essential part of XML markup and are not loaded during XML parsing. You can override this behavior with
parse_piflag. -
Declaration node (
node_declaration) represents document declarations in XML. Declaration nodes have a name ("xml") and an optional collection of attributes, but do not have value or children. There can be only one declaration node in a document; moreover, it should be the topmost node (its parent should be the document). The example XML representation of a declaration node is as follows:<?xml version="1.0"?>
Here the node has name "xml" and a single attribute with name "version" and value "1.0". By default declaration nodes are treated as non-essential part of XML markup and are not loaded during XML parsing. You can override this behavior with
parse_declarationflag. Also, by default a dummy declaration is output when XML document is saved unless there is already a declaration in the document; you can disable this withformat_no_declarationflag. -
Document type declaration node (
node_doctype) represents document type declarations in XML. Document type declaration nodes have a value, which corresponds to the entire document type contents; no additional nodes are created for inner elements like <!ENTITY>. There can be only one document type declaration node in a document; moreover, it should be the topmost node (its parent should be the document). The example XML representation of a document type declaration node is as follows:<!DOCTYPE greeting [ <!ELEMENT greeting (#PCDATA)> ]>
Here the node has value "greeting [ <!ELEMENT greeting (#PCDATA)> ]". By default document type declaration nodes are treated as non-essential part of XML markup and are not loaded during XML parsing. You can override this behavior with
parse_doctypeflag.
C++ interface
Despite the fact that there are several node types, there are only three C++ classes representing the tree (xml_document, xml_node, xml_attribute); some operations on xml_node are only valid for certain node types. The classes are described below.
xml_document is the owner of the entire document structure; it is a non-copyable class. The interface of xml_document consists of loading functions (see Loading document), saving functions (see Saving document) and the entire interface of xml_node, which allows for document inspection and/or modification. Note that while xml_document is a sub-class of xml_node, xml_node is not a polymorphic type; the inheritance is present only to simplify usage. Alternatively you can use the document_element function to get the element node that's the immediate child of the document.
Default constructor of xml_document initializes the document to the tree with only a root node (document node). You can then populate it with data using either tree modification functions or loading functions; all loading functions destroy the previous tree with all occupied memory, which puts existing node/attribute handles for this document to invalid state. If you want to destroy the previous tree, you can use the xml_document::reset function; it destroys the tree and replaces it with either an empty one or a copy of the specified document. Destructor of xml_document also destroys the tree, thus the lifetime of the document object should exceed the lifetimes of any node/attribute handles that point to the tree.
xml_node is the handle to document node; it can point to any node in the document, including the document node itself. There is a common interface for nodes of all types; the actual node type can be queried via the xml_node::type() method. Note that xml_node is only a handle to the actual node, not the node itself - you can have several xml_node handles pointing to the same underlying object. Destroying a xml_node handle does not destroy the node and does not remove it from the tree. The size of xml_node is equal to that of a pointer, so it is nothing more than a lightweight wrapper around a pointer; you can safely pass or return xml_node objects by value without additional overhead.
There is a special value of xml_node type, known as null node or empty node (such nodes have type equals node_null). It does not correspond to any node in any document, and thus resembles a null pointer. However, all operations are defined on empty nodes; generally the operations don't do anything and return empty nodes/attributes or empty strings as their result (see documentation for specific functions for more detailed information). This is useful for chaining calls; i.e. you can get the grandparent of a node like so: node.parent().parent(); if a node is a null node or it does not have a parent, the first parent() call returns null node; the second parent() call then also returns null node, which makes error handling easier.
xml_attribute is the handle to an XML attribute; it has the same semantics as xml_node, i.e. there can be several xml_attribute handles pointing to the same underlying object and there is a special null attribute value, which propagates to function results.
Both xml_node and xml_attribute have the default constructor which initializes them to null objects.
xml_node and xml_attribute try to behave like pointers, that is, they can be compared with other objects of the same type, making it possible to use them as keys in associative containers. All handles to the same underlying object are equal, and any two handles to different underlying objects are not equal. Null handles only compare as equal to null handles. The result of relational comparison can not be reliably determined from the order of nodes in file or in any other way. Do not use relational comparison operators except for search optimization (i.e. associative container keys).
If you want to use xml_node or xml_attribute objects as keys in hash-based associative containers, you can use the hash_value member functions. They return the hash values that are guaranteed to be the same for all handles to the same underlying object. The hash value for null handles is 0. Note that hash value does not depend on the content of the node, only on the location of the underlying structure in memory - this means that loading the same document twice will likely produce different hash values, and copying the node will not preserve the hash.
Finally handles can be implicitly cast to boolean-like objects, so that you can test if the node/attribute is empty with the following code: if (node) { … } or if (!node) { … } else { … }. Alternatively you can check if a given xml_node/xml_attribute handle is null by calling the following methods:
bool xml_attribute::empty() const; bool xml_node::empty() const;
Nodes and attributes do not exist without a document tree, so you can't create them without adding them to some document. Once underlying node/attribute objects are destroyed, the handles to those objects become invalid. While this means that destruction of the entire tree invalidates all node/attribute handles, it also means that destroying a subtree (by calling xml_node::remove_child) or removing an attribute invalidates the corresponding handles. There is no way to check handle validity; you have to ensure correctness through external mechanisms.
Unicode interface
There are two choices of interface and internal representation when configuring pugixml: you can either choose the UTF-8 (also called char) interface or UTF-16/32 (also called wchar_t) one. The choice is controlled via PUGIXML_WCHAR_MODE define; you can set it via pugiconfig.hpp or via preprocessor options, as discussed in Additional configuration options. If this define is set, the wchar_t interface is used; otherwise (by default) the char interface is used. The exact wide character encoding is assumed to be either UTF-16 or UTF-32 and is determined based on the size of wchar_t type.
All tree functions that work with strings work with either C-style null terminated strings or STL strings of the selected character type. For example, node name accessors look like this in char mode:
const char* xml_node::name() const; bool xml_node::set_name(const char* value);
and like this in wchar_t mode:
const wchar_t* xml_node::name() const; bool xml_node::set_name(const wchar_t* value);
There is a special type, pugi::char_t, that is defined as the character type and depends on the library configuration; it will be also used in the documentation hereafter. There is also a type pugi::string_t, which is defined as the matching STL string of the character type; it corresponds to std::string in char mode and to std::wstring in wchar_t mode. Similarly, string_view_t is defined to be std::basic_string_view<char_t>. Overloads for string_view_t are only available when building for C++17 or later (see PUGIXML_HAS_STRING_VIEW).
In addition to the interface, the internal implementation changes to store XML data as pugi::char_t; this means that these two modes have different memory usage characteristics - generally UTF-8 mode is more memory and performance efficient, especially if sizeof(wchar_t) is 4. The conversion to pugi::char_t upon document loading and from pugi::char_t upon document saving happen automatically, which also carries a minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice.
There are cases when you'll have to convert string data between UTF-8 and wchar_t encodings; the following helper functions are provided for such purposes:
std::string as_utf8(const wchar_t* str); std::wstring as_wide(const char* str);
Both functions accept a null-terminated string as an argument str, and return the converted string. as_utf8 performs conversion from UTF-16/32 to UTF-8; as_wide performs conversion from UTF-8 to UTF-16/32. Invalid UTF sequences are silently discarded upon conversion. str has to be a valid string; passing null pointer results in undefined behavior. There are also two overloads with the same semantics which accept a string as an argument:
std::string as_utf8(const std::wstring& str); std::wstring as_wide(const std::string& str);
Thread-safety guarantees*
Almost all functions in pugixml have the following thread-safety guarantees:
- it is safe to call free (non-member) functions from multiple threads
- it is safe to perform concurrent read-only accesses to the same tree (all constant member functions do not modify the tree)
- it is safe to perform concurrent read/write accesses on multiple trees, as long as each tree is only accessed from a single thread at a time
Concurrent read/write access to a single tree requires synchronization, for example via a reader-writer lock. Modification includes altering document structure and altering individual node/attribute data, i.e. changing names/values.
The only exception is set_memory_management_functions; it modifies global variables and as such is not thread-safe. Its usage policy has more restrictions, see Custom memory allocation/deallocation functions.
Exception guarantees
With the exception of XPath, pugixml itself does not throw any exceptions. Additionally, most pugixml functions have a no-throw exception guarantee.
This is not applicable to functions that operate on STL strings or IOstreams; such functions have either strong guarantee (functions that operate on strings) or basic guarantee (functions that operate on streams). Also functions that call user-defined callbacks (i.e. xml_node::traverse or xml_node::find_node) do not provide any exception guarantees beyond the ones provided by the callback.
If exception handling is not disabled with PUGIXML_NO_EXCEPTIONS define, XPath functions may throw xpath_exception on parsing errors; also, XPath functions may throw std::bad_alloc in low memory conditions. Still, XPath functions provide strong exception guarantee.
Memory Management
pugixml requests the memory needed for document storage in big chunks, and allocates document data inside those chunks. This section discusses replacing functions used for chunk allocation and internal memory management implementation.
[...]
Loading Documents*
pugixml provides several functions for loading XML data from various places - files, C++ iostreams, memory buffers. All functions use an extremely fast non-validating parser. This parser is not fully W3C conformant - it can load any valid XML document, but does not perform some well-formedness checks. While considerable effort is made to reject invalid XML documents, some validation is not performed for performance reasons. Also some XML transformations (i.e. EOL handling or attribute value normalization) can impact parsing speed and thus can be disabled. However for vast majority of XML documents there is no performance difference between different parsing options. Parsing options also control whether certain XML nodes are parsed; see Parsing Options for more information.
XML data is always converted to internal character format (see Unicode interface) before parsing. pugixml supports all popular Unicode encodings (UTF-8, UTF-16 (big and little endian), UTF-32 (big and little endian); UCS-2 is naturally supported since it's a strict subset of UTF-16) as well as some non-Unicode encodings (Latin-1) and handles all encoding conversions automatically. Unless explicit encoding is specified, loading functions perform automatic encoding detection based on source XML data, so in most cases you do not have to specify document encoding. Encoding conversion is described in more detail in Encodings.
Loading document from file*
The most common source of XML data is files; pugixml provides dedicated functions for loading an XML document from file:
xml_parse_result xml_document::load_file(const char* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto); xml_parse_result xml_document::load_file(const wchar_t* path, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
These functions accept the file path as its first argument, and also two optional arguments, which specify parsing options (see Parsing options) and input data encoding (see Encodings). The path has the target operating system format, so it can be a relative or absolute one, it should have the delimiters of the target system, it should have the exact case if the target file system is case-sensitive, etc.
File path is passed to the system file opening function as is in case of the first function (which accepts const char* path); the second function either uses a special file opening function if it is provided by the runtime library or converts the path to UTF-8 and uses the system file opening function.
load_file destroys the existing document tree and then tries to load the new tree from the specified file. The result of the operation is returned in an xml_parse_result object; this object contains the operation status and the related information (i.e. last successfully parsed position in the input file, if parsing fails). See Handling parsing errors for error handling details.
This is an example of loading XML document from file (samples/load_file.cpp):
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load_file("tree.xml");
std::cout << "Load result: " << result.description() << ", mesh name: " << doc.child("mesh").attribute("name").value() << std::endl;
Loading document from memory*
Loading document from C++ IOstreams*
Handling parsing errors*
All document loading functions return the parsing result via xml_parse_result object. It contains parsing status, the offset of last successfully parsed character from the beginning of the source stream, and the encoding of the source stream:
struct xml_parse_result
{
xml_parse_status status;
ptrdiff_t offset;
xml_encoding encoding;
operator bool() const;
const char* description() const;
};
Parsing status is represented as the xml_parse_status enumeration and can be one of the following:
status_okmeans that no error was encountered during parsing; the source stream represents the valid XML document which was fully parsed and converted to a tree.status_file_not_foundis only returned by load_file function and means that file could not be opened.status_io_erroris returned byload_filefunction and by load functions withstd::istream/std::wstreamarguments; it means that some I/O error has occurred during reading the file/stream.status_out_of_memorymeans that there was not enough memory during some allocation; any allocation failure during parsing results in this error.status_internal_errormeans that something went horribly wrong; currently this error does not occurstatus_unrecognized_tagmeans that parsing stopped due to a tag with either an empty name or a name which starts with an ilegal character, such as #.status_bad_pimeans that parsing stopped due to incorrect document declaration/processing instructionstatus_bad_comment,status_bad_cdata,status_bad_doctypeandstatus_bad_pcdatamean that parsing stopped due to the invalid construct of the respective type.status_bad_start_elementmeans that parsing stopped because starting tag either had no closing>symbol or contained some incorrect symbolstatus_bad_attributemeans that parsing stopped because there was an incorrect attribute, such as an attribute without value or with value that is not quoted (note that<node attr=1>is incorrect in XML)status_bad_end_elementmeans that parsing stopped because ending tag had incorrect syntax (i.e. extra non-whitespace symbols between tag name and>)status_end_element_mismatchmeans that parsing stopped because the closing tag did not match the opening one (i.e.<node></nedo>) or because some tag was not closed at allstatus_no_document_elementmeans that no element nodes were discovered during parsing; this usually indicates an empty or invalid document
description() member function can be used to convert parsing status to a string; the returned message is always in English, so you'll have to write your own function if you need a localized string. However please note that the exact messages returned by description() function may change from version to version, so any complex status handling should be based on status value. Note that description() returns a char string even in PUGIXML_WCHAR_MODE; you'll have to call as_wide to get the wchar_t string.
If parsing failed because the source data was not a valid XML, the resulting tree is not destroyed - despite the fact that load function returns error, you can use the part of the tree that was successfully parsed. Obviously, the last element may have an unexpected name/value; for example, if the attribute value does not end with the necessary quotation mark, as in <node attr="value>some data</node> example, the value of attribute attr will contain the string value>some data</node>
.
In addition to the status code, parsing result has an offset member, which contains the offset of last successfully parsed character if parsing failed because of an error in source data; otherwise offset is 0. For parsing efficiency reasons, pugixml does not track the current line during parsing; this offset is in units of pugi::char_t (bytes for character mode, wide characters for wide character mode). Many text editors support 'Go To Position' feature - you can use it to locate the exact error position. Alternatively, if you're loading the document from memory, you can display the error chunk along with the error description (see the example code below).
Parsing result also has an encoding member, which can be used to check that the source data encoding was correctly guessed. It is equal to the exact encoding used during parsing (i.e. with the exact endianness); see Encodings for more information.
Parsing result object can be implicitly converted to bool; if you do not want to handle parsing errors thoroughly, you can just check the return value of load functions as if it was a bool: if (doc.load_file("file.xml")) { … } else { … }.
This is an example of handling loading errors (samples/load_error_handling.cpp):
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load_string(source);
if (result)
{
std::cout << "XML [" << source << "] parsed without errors,\
attr value: [" << doc.child("node").attribute("attr").value() << "]\n\n";
}
else
{
std::cout << "XML [" << source << "] parsed with errors,\
attr value: [" << doc.child("node").attribute("attr").value() << "]\n";
std::cout << "Error description: " << result.description() << "\n";
std::cout << "Error offset: " << result.offset << " \
(error at [..." << (source + result.offset) << "]\n\n";
}
Parsing Options
All document loading functions accept the optional parameter options. This is a bitmask that customizes the parsing process: you can select the node types that are parsed and various transformations that are performed with the XML text. Disabling certain transformations can improve parsing performance for some documents; however, the code for all transformations is very well optimized, and thus the majority of documents won't get any performance benefit. As a rule of thumb, only modify parsing flags if you want to get some nodes in the document that are excluded by default (i.e. declaration or comment nodes).
These flags control the resulting tree contents:
parse_declarationdetermines if XML document declaration (node with type node_declaration) is to be put in DOM tree. If this flag is off, it is not put in the tree, but is still parsed and checked for correctness. This flag is off by default.parse_doctypedetermines if XML document type declaration (node with type node_doctype) is to be put in DOM tree. If this flag is off, it is not put in the tree, but is still parsed and checked for correctness. This flag is off by default.parse_pidetermines if processing instructions (nodes with type node_pi) are to be put in DOM tree. If this flag is off, they are not put in the tree, but are still parsed and checked for correctness. Note that<?xml …?>(document declaration) is not considered to be a PI. This flag is off by default.parse_commentsdetermines if comments (nodes with typenode_comment) are to be put in DOM tree. If this flag is off, they are not put in the tree, but are still parsed and checked for correctness. This flag is off by default.parse_cdatadetermines if CDATA sections (nodes with type node_cdata) are to be put in DOM tree. If this flag is off, they are not put in the tree, but are still parsed and checked for correctness. This flag is on by default.parse_trim_pcdatadetermines if leading and trailing whitespace characters are to be removed from PCDATA nodes. While for some applications leading/trailing whitespace is significant, often the application only cares about the non-whitespace contents so it's easier to trim whitespace from text during parsing. This flag is off by default.parse_ws_pcdatadetermines if PCDATA nodes (nodes with type node_pcdata) that consist only of whitespace characters are to be put in DOM tree. Often whitespace-only data is not significant for the application, and the cost of allocating and storing such nodes (both memory and speed-wise) can be significant. For example, after parsing XML string<node> <a/> </node>, node element will have three children when parse_ws_pcdata is set (child with type node_pcdata and value " ", child with type node_element and name "a", and another child with type node_pcdata and value " "), and only one child when parse_ws_pcdata is not set. This flag is off by default.parse_ws_pcdata_singledetermines if whitespace-only PCDATA nodes that have no sibling nodes are to be put in DOM tree. In some cases application needs to parse the whitespace-only contents of nodes, i.e.<node> </node>, but is not interested in whitespace markup elsewhere. It is possible to useparse_ws_pcdataflag in this case, but it results in excessive allocations and complicates document processing; this flag can be used to avoid that. As an example, after parsing XML string<node> <a> </a> </node>with parse_ws_pcdata_single flag set, <node> element will have one child <a>, and <a> element will have one child with type node_pcdata and value " ". This flag has no effect if parse_ws_pcdata is enabled. This flag is off by default.parse_embed_pcdatadetermines if PCDATA contents is to be saved as element values. Normally element nodes have names but not values; this flag forces the parser to store the contents as a value if PCDATA is the first child of the element node (otherwise PCDATA node is created as usual). This can significantly reduce the memory required for documents with many PCDATA nodes. To retrieve the data you can usexml_node::value()on the element nodes or any of the higher-level functions like child_value or text. This flag is off by default. Since this flag significantly changes the DOM structure it is only recommended for parsing documents with many PCDATA nodes in memory-constrained environments. This flag is off by default.parse_merge_pcdatadetermines if PCDATA contents is to be merged with the previous PCDATA node when no intermediary nodes are present between them. If the PCDATA contains CDATA sections, PI nodes, or comments in between, and either of the flags parse_cdata, parse_pi, parse_comments is not set, the contents of the PCDATA node will be merged with the previous one. This flag is off by default. Note that this flag is not compatible with parse_embed_pcdata.parse_fragmentdetermines if document should be treated as a fragment of a valid XML. Parsing document as a fragment leads to top-level PCDATA content (i.e. text that is not located inside a node) to be added to a tree, and additionally treats documents without element nodes as valid and permits multiple top-level element nodes (currently multiple top-level element nodes are also permitted when the flag is off, but that behavior should not be relied on). This flag is off by default.
These flags control the transformation of tree element contents:
parse_escapesdetermines if character and entity references are to be expanded during the parsing process. Character references have the form&#…;or&#x…;(… is Unicode numeric representation of character in either decimal (&#…;) or hexadecimal (&#x…;) form), entity references are<,>,&,'and"(note that as pugixml does not handle DTD, the only allowed entities are predefined ones). If character/entity reference can not be expanded, it is left as is, so you can do additional processing later. Reference expansion is performed on attribute values and PCDATA content. This flag is on by default.parse_eoldetermines if EOL handling (that is, replacing sequences\r\nby a single\ncharacter, and replacing all standalone\rcharacters by\n) is to be performed on input data (that is, comment contents, PCDATA/CDATA contents and attribute values). This flag is on by default.parse_wconv_attributedetermines if attribute value normalization should be performed for all attributes. This means, that whitespace characters (new line, tab and space) are replaced with space (' '). New line characters are always treated as if parse_eol is set, i.e.\r\nis converted to a single space. This flag is on by default.parse_wnorm_attributedetermines if extended attribute value normalization should be performed for all attributes. This means, that after attribute values are normalized as ifparse_wconv_attributewas set, leading and trailing space characters are removed, and all sequences of space characters are replaced by a single space character. parse_wconv_attribute has no effect if this flag is on. This flag is off by default.
Additionally there are three predefined option masks:
parse_minimalhas all options turned off. This option mask means that pugixml does not add declaration nodes, document type declaration nodes, PI nodes, CDATA sections and comments to the resulting tree and does not perform any conversion for input data, so theoretically it is the fastest mode. However, as mentioned above, in practice parse_default is usually equally fast.parse_defaultis the default set of flags, i.e. it has all options set to their default values. It includes parsing CDATA sections (comments/PIs are not parsed), performing character and entity reference expansion, replacing whitespace characters with spaces in attribute values and performing EOL handling. Note, that PCDATA sections consisting only of whitespace characters are not parsed (by default) for performance reasons.parse_fullis the set of flags which adds nodes of all types to the resulting tree and performs default conversions for input data. It includes parsing CDATA sections, comments, PI nodes, document declaration node and document type declaration node, performing character and entity reference expansion, replacing whitespace characters with spaces in attribute values and performing EOL handling. Note, that PCDATA sections consisting only of whitespace characters are not parsed in this mode.
This is an example of using different parsing options (samples/load_options.cpp):
const char* source = "< "; // Parsing with default options; note that comment node is not added to the tree, and entity reference < is expanded doc.load_string(source); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n"; // Parsing with additional parse_comments option; comment node is now added to the tree doc.load_string(source, pugi::parse_default | pugi::parse_comments); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n"; // Parsing with additional parse_comments option and without the (default) parse_escapes option; < is not expanded doc.load_string(source, (pugi::parse_default | pugi::parse_comments) & ~pugi::parse_escapes); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n"; // Parsing with minimal option mask; comment node is not added to the tree, and < is not expanded doc.load_string(source, pugi::parse_minimal); std::cout << "First node value: [" << doc.first_child().value() << "], node child value: [" << doc.child_value("node") << "]\n";