Regular Expressions in the STL
Regular expressions, defined in the <regex>
header, are a powerful feature of the Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier.
Regular Expression Functions Provided
Some of the functions provided are:
regex_match()
: Match a regular expression against a string (of known size).regex_search()
: Search for a string that matches a regular expression in an (arbitrarily long) stream of data. The result of a regex_search() is a collection of matches, typically represented as ansmatch
, which is a container of regex results.regex_replace()
: Search for strings that match a regular expression in an (arbitrarily long) stream of data and replace them.regex_iterator
: iterate over matches and submatches.regex_token_iterator
: iterate over non-matches.
Some simple operations are next exemplified:
#include <regex> #include <iostream> int main (int argc, const char * argv[]) { std::regex r("st|mt|tr"); std::cerr << "st|mt|tr" << " matches st? " << std::regex_match("st", r) << std::endl; std::cerr << "st|mt|tr" << " matches mt? " << std::regex_match("mt", r) << std::endl; std::cerr << "st|mt|tr" << " matches spruce? " << std::regex_match("spruce", r) << std::endl; return 0; }
Different Regular Expression Grammars
There are several different grammars for regular expressions. For this reason, C++ includes support for several of these grammars: ECMAScript, basic, extended, awk, grep, and egrep. If you already know any of these regular expression grammars, you can use it straight away in C++ by telling the regular expression library to use that specific syntax ( syntax_option_type ). The default grammar in C++ is ECMAScript whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it's recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this section.
ECMAScript Syntax
The ECMAScript 3 regular expression grammar in C++ is ECMA-262 grammar with modifications marked with (C++ only) below.
The modified regular expression grammar is mostly ECMAScript RegExp grammar with a POSIX-type expansion on locales under ClassAtom. Some clarifications on equality checks and number parsing is made.
The normative references
in the standard specifies ECMAScript 3.
See the MDN Guide on JavaScript RegExp for an overview on the dialect features.
ECMAScript syntax recognizes the following special characters:
^ $ \ . * + ? ( ) [ ] { } |
If you need to match one of these special characters, you need to escape it using the \ character. For example:
\[ or \. or \* or \\
C++-Only Differences
basic Syntax
extended Syntax
awk Syntax
grep Syntax
egrep Syntax
Using Raw Strings
Use raw string literals in regular expressions.
The regular expression for the text C++
is fairly unwieldly: C\\+\\+
. You have to use two backslashes for each + sign. First, the + sign is a unique character in a regular expression. Second, the backslash is a special character in a string. Therefore one backslash escapes the + sign; the other backslash escapes the backslash. By using a raw string literal, the second backslash is not necessary anymore because the backslash is not interpreted in the string.
#include <regex> //... std::string regExpr("C\\+\\+"); std::string regExprRaw(R"(C\+\+)");
Procedure for Applying Regular Expressions
- Define the regular expression [object]
-
std::string text="C++ or c++."; std::string regExpr(R"(C\+\+)"); std::regex rgx(regExpr);
- Store the result of the search
-
std::smatch result; std::regex_search(text, result, rgx);
- Process the result
-
std::cout << result[0] << '\n';
Text Types
The text type determines the character type of the regular expression and the type of the search result.
The table below shows the four different combinations.
Text type | Regular expression type | Result type |
const char* | std::regex | std::cmatch |
std::string | std::regex | std::smatch |
const wchar_t* | std::wregex | std::wcmatch |
std::wstring | std::wregex | std::wsmatch |
Regular Expression Objects
Objects of type regular expression are instances of the class template template <class charT, class traits= regex_traits <charT>>
class basic_regex parametrized by their character type and traits class. The traits class defines the interpretation of the properties of regular grammar. There are two type synonyms in C++:
typedef basic_regex<char> regex; typedef basic_regex<wchar_t> wregex;
You can further customize the object of type regular expression. Therefore you can specify the grammar used or adapt the syntax. As mentioned, C++ supports the basic, extended, awk, grep, and egrep grammars.
A regular expression qualified by the std::regex_constants::icase flag
is case insensitive. If you want to adopt the syntax, you have to specify the grammar explicitly.
// regexGrammar.cpp ... #include <regex> ... using std::regex_constants::ECMAScript; using std::regex_constants::icase; std::string theQuestion="C++ or c++, that's the question."; std::string regExprStr(R"(c\+\+)"); std::regex rgx(regExprStr); std::smatch smatch; if (std::regex_search(theQuestion, smatch, rgx)){ std::cout << "case sensitive: " << smatch[0]; } std::regex rgxIn(regExprStr, ECMAScript|icase); if (std::regex_search(theQuestion, smatch, rgxIn)){ std::cout << "case insensitive: " << smatch[0]; }
If you use the case-sensitive regular expression rgx, the result of the search in the text theQuestion is c++
. That's not the case if your case-insensitive regular expression rgxIn is applied. Now you get the match string C++
.
The Search Result match_results
*
The object of type std::match_results
is the result of a std::regex_match
or std::regex_search
.
std::match_results
is a sequence container having at least one capture group of a std::sub_match
object. The std::sub_match
objects are sequences of characters.
C++ has four typedef's for std::match_results
:
typedef match_results<const char*> cmatch; typedef match_results<const wchar_t*> wcmatch; typedef match_results<string::const_iterator> smatch; typedef match_results<wstring::const_iterator> wsmatch;
The search result std::smatch
has a powerful interface.
Member Function | Description |
---|---|
smatch.size() | Returns the number of capture groups. |
smatch.empty() | Returns if the search result has a capture group. |
smatch[i] | Returns the ith capture group. |
smatch.length(i) | Returns the length of the ith capture group. |
smatch.position(i) | Returns the position of the ith capture group. |
smatch.str(i) | Returns the ith capture group as string. |
smatch.prefix() and smatch.suffix() | Returns the string before and after the capture group. |
smatch.begin() and smatch.end() | Returns the begin and end iterator for the capture groups. |
smatch.format(...) | Formats std::smatch objects for the output. |
The following program shows the output of the first four capture groups for different regular expressions.
// captureGroups.cpp ... #include <regex> ... using namespace std; void showCaptureGroups(const string& regEx, const string& text){ regex rgx(regEx); smatch smatch; if (regex_search(text, smatch, rgx)) { cout << regEx << text << smatch[0] << " " << smatch[1] << " "<< smatch[2] << " " << smatch[3] << endl; } } showCaptureGroups("abc+", "abccccc"); showCaptureGroups("(a+)(b+)", "aaabccc"); showCaptureGroups("((a+)(b+))", "aaabccc"); showCaptureGroups("(ab)(abc)+", "ababcabc");
td::sub_match
The capture groups are of type std::sub_match
. As with std::match_results
, C++ defines the following four type synonyms.
typedef sub_match<const char*> csub_match; typedef sub_match<const wchar_t*> wcsub_match; typedef sub_match<string::const_iterator> ssub_match; typedef sub_match<wstring::const_iterator> wssub_match;
You can further analyze the capture group cap.
Member Function | Description |
---|---|
cap.matched() | Indicates if this match was successful. |
cap.first() and cap.end() | Returns the begin and end iterator of the character sequence. |
cap.length() | Returns the length of the capture group. |
cap.str() | Returns the capture group as [a] string. |
cap.compare(other) | Compares the current capture group with the other capture group. |
Here is a code snippet showing the interplay between the search result std::match_results
and its capture groups std::sub_match
's:
// subMatch.cpp ... #include <regex> ... using std::cout; std::string privateAddress="192.168.178.21"; std::string regEx(R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))"); std::regex rgx(regEx); std::smatch smatch; if (std::regex_match(privateAddress, smatch, rgx)) { for (auto cap: smatch) { cout << "capture group: " << cap << '\n'; if (cap.matched) { std::for_each(cap.first, cap.second, [](int v) { cout << std::hex << v << " "; }); cout << '\n'; } } // for } ... capture group: 192.168.178.21 31 39 32 2e 31 36 38 2e 31 37 38 2e 32 31 capture group: 192 31 39 32 capture group: 168 31 36 38 capture group: 178 31 37 38 capture group: 21 32 31
The regular expression regEx stands for an IPv4 address. regEx extracts the address's components using capture groups. Finally, the capture groups and the characters in ASCII are displayed in hexadecimal values.
Matching
std::regex_match
determines if the text matches a text pattern. You can further analyze the search result, which is of type std::match_results
, and is set by a different STL global: std::regex_search
.
An Example
The code snippet below shows three simple applications of std::regex_match
: a C string, a C++ string, and a range returning only a boolean. The three variants are available for std::match_results
objects, respectively.
// match.cpp ... #include <regex> ... std::string numberRegEx(R"([-+]?([0-9]*\.[0-9]+|[0-9]+))"); std::regex rgx(numberRegEx); const char* numChar{"2011"}; if (std::regex_match(numChar, rgx)) { std::cout << numChar << "is a number." << '\n'; } // 2011 is a number. const std::string numStr{"3.14159265359"}; if (std::regex_match(numStr, rgx)){ std::cout << numStr << " is a number." << '\n'; } // 3.14159265359 is a number. const std::vector<char> numVec{{'-', '2', '.', '7', '1', '8', '2', '8', '1', '8', '2', '8'}}; if (std::regex_match(numVec.begin(), numVec.end(), rgx)) { for (auto c: numVec) { std::cout << c ;}; std::cout << "is a number." << '\n'; } // if // -2.718281828 is a number.
std::regex_match
(C++11)
Its constructors may take as its first argument(s):
- a beginning and an end iterators, or
- a pointer to const CHAR (
const CHAR*
), or - a constant reference to a string (
std::basic_string<CHAR> &
)
These one or two paramenters may be followed by a std::match_results
non-constant reference.
The next parameter is mandatory: a reference to a std::basic_regex
object
Last is an optional flags parameter.
std::regex_constants::match_flag_type
Flags
Their type is implementation-defined.
Name | Explanation |
---|---|
match_not_bol | The first character in [first, last) will be treated as if it is not at the beginning of a line (i.e. ^ will not match [first, first)). |
match_not_eol | The last character in [first, last) will be treated as if it is not at the end of a line (i.e. $ will not match [last, last)). |
match_not_bow | \b will not match [first, first). |
match_not_eow | \b will not match [last, last). |
match_any | If more than one match is possible, then any match is an acceptable result. |
match_not_null | Do not match empty sequences. |
match_continuous | Only match a sub-sequence that begins at first. |
match_prev_avail | --first is a valid iterator position. |
When | set, causes match_not_bol and match_not_bow to be ignored. |
format_default | Use ECMAScript rules to construct strings in std::regex_replace (syntax documentation). |
format_sed | Use POSIX sed utility rules in std::regex_replace (syntax documentation). |
format_no_copy | Do not copy un-matched strings to the output in std::regex_replace. |
format_first_only | Only replace the first match in std::regex_replace. |
All constants, except for match_default and format_default, are bitmask elements. The match_default and format_default constants are empty bitmasks.
Searching
std::regex_search<CHAR>
checks if the text contains a text pattern. You can use the function with and without a std::match_results
object and apply it to a C string, a C++ string, or a range.
An Example
The example below shows how to use std::regex_search
with texts of type const char*
, std::string
, const wchar_t*
, and std::wstring
.
// search.cpp ... #include <regex> ... // regular expression holder for time std::regex crgx("([01]?[0-9]|2[0-3]):[0-5][0-9]"); // const char* std::cmatch cmatch; const char* ctime{"Now it is 23:10." }; if (std::regex_search(ctime, cmatch, crgx)) { std::cout << ctime << '\n'; std::cout << "Time: " << cmatch[0] << '\n'; // Time: 23:10 } // std::string std::smatch smatch; std::string stime{"Now it is 23:25." }; if (std::regex_search(stime, smatch, crgx)) { std::cout << stime << '\n'; std::cout << "Time: " << smatch[0] << '\n'; // Time: 23:25 } // regular expression holder for time std::wregex wrgx(L"([01]?[0-9]|2[0-3]):[0-5][0-9]"); // const wchar_t* std::wcmatch wcmatch; const wchar_t* wctime{L "Now it is 23:47." }; if (std::regex_search(wctime, wcmatch, wrgx)) { std::wcout << wctime << '\n'; std::wcout << "Time: " << wcmatch[0] << '\n'; // Time: 23:47 } // std::wstring std::wsmatch wsmatch; std::wstring wstime{L "Now it is 00:03." }; if (std::regex_search(wstime, wsmatch, wrgx)) { std::wcout << wstime << '\n'; std::wcout << "Time: " << wsmatch[0] << '\n'; // Time: 00:03 }
std::regex_search
(C++11)
Determines if there is a match between the regular expression e and some subsequence in the target character sequence. The detailed match result is stored in input-output parameter std::match_results<> m
(if present).
The return type is bool
.
The main difference between std::regex_match
and std::regex_search
is...
Replacing
std::regex_replace
replaces sequences in a text matching a text pattern. It returns in the simple form std::regex_replace(text, regex, replString)
its result as string. The function replaces an occurrence of regex in text with replString.
// replace.cpp ... #include <regex> ... using namespace std; string future{"Future"}; string unofficialName{ "The unofficial name of the new C++ standard is C++0x."}; regex rgxCpp{R"(C\+\+0x)"}; string newCppName{"C++11"}; string newName{regex_replace(unofficialName, rgxCpp, newCppName)}; regex rgxOff{"unofficial"}; string makeOfficial{"official"}; string officialName{regex_replace(newName, rgxOff, makeOfficial)}; cout << officialName << endl; // The official name of the new C++ standard is C++11.
In addition to the simple version, C++ has a version of std::regex_replace
working on ranges. It enables you to push the modified string directly into another string:
typedef basic_regex<char> regex; std::string str2; std::regex_replace(std::back_inserter(str2), text.begin(), text.end(), regex,replString);
All variants of std::regex_replace
have an additional optional parameter. If you set the parameter to std::regex_constants::format_no_copy
, you will get the part of the text matching the regular expression. The unmatched text is not copied. If you set the parameter to std::regex_constants::format_first_only
, then std::regex_replace
will only be applied once.
Formatting
std::regex_replace
and std::match_results.format
in combination with capture groups enables you to format text. You can use a format string together with a placeholder to insert the value.
Here are both possibilities, first with regex
:
// format.cpp ... #include <regex> ... std::string future{"Future"}; const std::string unofficial{"unofficial, C++0x"}; const std::string official{"official, C++11"}; std::regex regValues{"(.*),(.*)"}; std::string standardText{"The $1 name of the new C++ standard is $2."}; std::string textNow = std::regex_replace(unofficial, regValues, standardText); std::cout << textNow << '\n'; // The unofficial name of the new C++ standard is C++0x. std::smatch smatch; if (std::regex_match(official, smatch, regValues)) { std::cout << smatch.str(); // official,C++11 std::string textFuture = smatch.format(standardText); std::cout << textFuture << '\n'; } // The official name of the new C++ standard is C++11.
In the function call std::regex_replace(unoffical, regValues, standardText)
, the text matching the first and second capture group of the regular expression regValues is extracted from the string unofficial. The placeholders $1
and $2
in the text standardText are then replaced by the extracted values. The strategy of smatch.format(standardTest)
is similar, but there is a difference:
The creation of the search results smatch
is separated from their usage when formatting the string.
In addition to capture groups, C++ supports additional format escape sequences. You can use them in format strings:
Format escape sequence | Description |
---|---|
$& | Returns the total match (0th capture group). |
$$ | Returns $. |
$` (backward tic) | Returns the text before the total match. |
$´ (forward tic) | Returns the text after the total match. |
‘$ i’ | Returns the ith capture group. |
Repeated Search
It's pretty convenient to iterate with std::regex_iterator
and std::regex_token_iterator
through the matched texts. std::regex_iterator
supports the matches and their capture groups. std::regex_token_iterator
supports more. You can address the components of each capture.
Using a negative index enables it to access the text between the matches.
std::regex_iterator
C++ defines the following four type synonyms for std::regex_iterator
:
typedef cregex_iterator regex_iterator<const char*> typedef wcregex_iterator regex_iterator<const wchar_t*> typedef sregex_iterator regex_iterator<std::string::const_iterator> typedef wsregex_iterator regex_iterator<std::wstring::const_iterator>
You may use std::regex_iterator
to count the occurrences of the words in a text:
// regexIterator.cpp ... #include <regex> #include <unordered_map> ... using std::cout; std::string text{"That's a (to me) amazingly frequent question. It may be the most freque\ ntly asked question. Surprisingly, C++11 feels like a new language: The pieces just fit t\ ogether better than they used to, and I find a higher-level style of programming more nat\ ural than before and as efficient as ever." }; std::regex wordReg{R"(\w+)"}; std::sregex_iterator wordItBegin(text.begin(), text.end(), wordReg); const std::sregex_iterator wordItEnd; std::unordered_map<std::string, std::size_t> allWords; for (; wordItBegin != wordItEnd; ++wordItBegin) { ++allWords[wordItBegin->str()]; } for (auto wordIt: allWords) cout << "(" << wordIt.first << ":" << wordIt.second << ")"; // (as:2)(of:1)(level:1)(find:1)(ever:1)(and:2)(natural:1)
A word consists of a least one word-character (\w+)
. This regular expression is used to define the begin iterator wordItBegin, then the end iterator wordItEnd is defined (default constructor).
The iteration through the matches happens in the for
loop. Each word increments the counter: ++allWords[wordItBegin]->str()]
. A word whose counter equals 1 is created if it is not already in allWords.
std::regex_token_iterator
C++ defines the following four type synonyms for std::regex_token_iterator
:
typedef cregex_token_iterator regex_token_iterator<const char*> typedef wcregex_token_iterator regex_token_iterator<const wchar_t*> typedef sregex_token_iterator regex_token_iterator<std::string::const_iterator> typedef wsregex_token_iterator regex_token_iterator<std::wstring::const_iterator>
std::regex_token_iterator
enables you to use indexes to explicitly specify which capture groups you are interested in. If you don't specify the index, you will get all capture groups, even though you can also request specific capture groups using their respective index.
The -1 index is particular: You can use -1 to address the text between the matches.
// tokenIterator.cpp ... using namespace std; std::string text{"Pete Becker, The C++ Standard Library Extensions, 2006:" "Nicolai Josuttis, The C++ Standard Library, 1999:" "Andrei Alexandrescu, Modern C++ Design, 2001"}; regex regBook(R"((\w+)\s(\w+),([\w\s\+]*),(\d{4}))"); sregex_token_iterator bookItBegin(text.begin(), text.end(), regBook);Regular Expressions const sregex_token_iterator bookItEnd; while (bookItBegin != bookItEnd){ cout << *bookItBegin++ << endl; } // Pete Becker,The C++ Standard Library Extensions,2006 // Nicolai Josuttis,The C++ Standard Library,1999 sregex_token_iterator bookItNameIssueBegin(text.begin(), text.end(), regBook, {{2,4}}); const sregex_token_iterator bookItNameIssueEnd;
Removing Greediness
To make regular expression repetitions non-greedy, a ? can be added behind the repeat as in *? , +? , ?? , and {...}? . A non-greedy repetition repeats its pattern as few times as possible while still matching the remainder of the regular expression.