How to write a Parser in C++
We can hope to parse some types of structured text with C++.
Parsers are classified into:
- event-driven: they just produce tokens as they proceed along an input stream; and
- DOM: they build an internal representation (Document Object Model) of the structure file they read; besides, the term DOM is usually applied in the context of HTML.
We can further classify parsers by the kind of file (or, generally, input stream) that they work on:
- text: understood to be made up of words and punctuation; perhaps a parser might recognize period-delimited sentences, paragraphs, and little more
- computer language: much more structured than text, they subdivide into non-code (such as XML and JSON) and code (programming), which is orders of magnitude more difficult to parse.
- code:
In order to parse a form of structured text, it is nearly always convenient to first tokenize it, that is to divide it into chunks such as string, puntuation, reserved word, opening or closing delimiter etc. (Divide and Conquer.)
Concepts
-
delimiters: enclose text or just runs of characters.
Some common delimiters are:
- double quotes (
"or') for text as in:He specifically said 'care' the last time we spoke.
- round brackets: sometimes used to enforce preference:
(a + b) * c
- angle brackets: they are common in XML or just HTML
- double quotes (
-
state: such as the reading cursor's being inside a string or inside a
CDATAsection (XML-specific)Besides, a parser may keep track of matching delimiters.
Often enough state can be kept in a stack structure.
tokens
structures to be filled in
recursive structures and calls
Common Actions/Operations
-
Skipping whitespace
-
Initializing a string stream
You can Initialize a
std:istringstreamfrom a string str like so:std::istringstream iss(str);
-
Recursive invocations
-
switch -
Variants
-
Result structures
-
Reading a single character then prepending it to a whole run
iss >> c; iss >> str; str = c + str;
...
C++ Elements
- Streams:
cinandcout(in<iostream>), file (in<fstream>) and string streams (in<sstream>) std::getline(ISTREAM& is, STRING& str, CHAR delimiter)std::basic_istream::peek()- strings:
std::stringandstring_view - manipulators, especifically
std::ws get()andget(char& c): extracts a single character from the stream. The character is either returned (first signature), or set as the value of its argument (second signature).switchkeyword
Skipping White Space
Manipulator std::ws extracts as many whitespace characters as possible from the current position in the input sequence. The extraction stops as soon as a non-whitespace character is found. These extracted whitespace characters are discarded.
Note: basic_istream objects have the skipws flag set by default: This applies a similar effect before the formatted extraction operations
Alternatively,
ios_base& skipws (ios_base& str);
sets the skipws format flag for the str stream.
When the skipws format flag is set, as many whitespace characters as necessary are read and discarded from the stream until a non-whitespace character is found before. This applies to every formatted input operation performed with operator>> on the stream.
Tab spaces, carriage returns and blank spaces are all considered whitespaces.
This flag can be unset with the noskipws manipulator, forcing extraction operations to consider leading whitepaces as part of the content to be extracted.
For standard streams, the skipws flag is set on initialization.