Regular Expressions in the STL

Regular expressions, defined in the <regex> header, are a powerful feature of the Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier.

Regular Expression Functions Provided

Some of the functions provided are:

regex_match(): Match a regular expression against a string (of known size).
regex_search(): Search for a string that matches a regular expression in an (arbitrarily long) stream of data. The result of a regex_search() is a collection of matches, typically represented as an smatch, which is a container of regex results.
regex_replace(): Search for strings that match a regular expression in an (arbitrarily long) stream of data and replace them.
regex_iterator: iterate over matches and submatches.
regex_token_iterator: iterate over non-matches.

Some simple operations are next exemplified:

#include <regex>
#include <iostream>

int main (int argc, const char * argv[]) {
    std::regex r("st|mt|tr");
    std::cerr << "st|mt|tr" << " matches st? " << std::regex_match("st", r) << std::endl;
    std::cerr << "st|mt|tr" << " matches mt? " << std::regex_match("mt", r) << std::endl;
    std::cerr << "st|mt|tr" << " matches spruce? " << std::regex_match("spruce", r) << std::endl;

    return 0;
}

Different Regular Expression Grammars

There are several different grammars for regular expressions. For this reason, C++ includes support for several of these grammars: ECMAScript, basic, extended, awk, grep, and egrep. If you already know any of these regular expression grammars, you can use it straight away in C++ by telling the regular expression library to use that specific syntax ( syntax_option_type ). The default grammar in C++ is ECMAScript whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it's recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this section.

ECMAScript Syntax

The ECMAScript 3 regular expression grammar in C++ is ECMA-262 grammar with modifications marked with (C++ only) below.

The modified regular expression grammar is mostly ECMAScript RegExp grammar with a POSIX-type expansion on locales under ClassAtom. Some clarifications on equality checks and number parsing is made.

The normative references in the standard specifies ECMAScript 3.

See the MDN Guide on JavaScript RegExp for an overview on the dialect features.

ECMAScript syntax recognizes the following special characters:

^ $ \ . * + ? ( ) [ ] { } |

If you need to match one of these special characters, you need to escape it using the \ character. For example:

\[ or \. or \* or \\

C++-Only Differences

basic Syntax

extended Syntax

awk Syntax

grep Syntax

egrep Syntax

Using Raw Strings

Use raw string literals in regular expressions.

The regular expression for the text C++ is fairly unwieldly: C\\+\\+. You have to use two backslashes for each + sign. First, the + sign is a unique character in a regular expression. Second, the backslash is a special character in a string. Therefore one backslash escapes the + sign; the other backslash escapes the backslash. By using a raw string literal, the second backslash is not necessary anymore because the backslash is not interpreted in the string.

#include <regex>

//...

std::string regExpr("C\\+\\+");
std::string regExprRaw(R"(C\+\+)");

Procedure for Applying Regular Expressions

Define the regular expression [object]

std::string text="C++ or c++.";
std::string regExpr(R"(C\+\+)");
std::regex rgx(regExpr);

Store the result of the search

std::smatch result;
std::regex_search(text, result, rgx);

Process the result

std::cout << result[0] << '\n';

Text Types

The text type determines the character type of the regular expression and the type of the search result.

The table below shows the four different combinations.

Text type	Regular expression type	Result type
const char*	std::regex	std::cmatch
std::string	std::regex	std::smatch
const wchar_t*	std::wregex	std::wcmatch
std::wstring	std::wregex	std::wsmatch

Regular Expression Objects

Objects of type regular expression are instances of the class template template <class charT, class traits= regex_traits <charT>> class basic_regex parametrized by their character type and traits class. The traits class defines the interpretation of the properties of regular grammar. There are two type synonyms in C++:

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

You can further customize the object of type regular expression. Therefore you can specify the grammar used or adapt the syntax. As mentioned, C++ supports the basic, extended, awk, grep, and egrep grammars.

A regular expression qualified by the std::regex_constants::icase flag is case insensitive. If you want to adopt the syntax, you have to specify the grammar explicitly.

// regexGrammar.cpp
...
#include <regex>

...

using std::regex_constants::ECMAScript;
using std::regex_constants::icase;

std::string theQuestion="C++ or c++, that's the question.";
std::string regExprStr(R"(c\+\+)");

std::regex rgx(regExprStr);
std::smatch smatch;

if (std::regex_search(theQuestion, smatch, rgx)){
std::cout << "case sensitive: " << smatch[0];
}
std::regex rgxIn(regExprStr, ECMAScript|icase);
if (std::regex_search(theQuestion, smatch, rgxIn)){
std::cout << "case insensitive: " << smatch[0];
}

If you use the case-sensitive regular expression rgx, the result of the search in the text theQuestion is c++. That's not the case if your case-insensitive regular expression rgxIn is applied. Now you get the match string C++.

The Search Result `match_results`*

The object of type std::match_results is the result of a std::regex_match or std::regex_search.

std::match_results is a sequence container having at least one capture group of a std::sub_match object. The std::sub_match objects are sequences of characters.

C++ has four typedef's for std::match_results:

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

The search result std::smatch has a powerful interface.

Member Function	Description
`smatch.size()`	Returns the number of capture groups.
`smatch.empty()`	Returns if the search result has a capture group.
`smatch[i]`	Returns the ith capture group.
`smatch.length(i)`	Returns the length of the ith capture group.
`smatch.position(i)`	Returns the position of the ith capture group.
`smatch.str(i)`	Returns the ith capture group as string.
`smatch.prefix() and smatch.suffix()`	Returns the string before and after the capture group.
`smatch.begin() and smatch.end()`	Returns the begin and end iterator for the capture groups.
`smatch.format(...)`	Formats std::smatch objects for the output.

The following program shows the output of the first four capture groups for different regular expressions.

// captureGroups.cpp
...
#include <regex>
...
using namespace std;

void showCaptureGroups(const string& regEx, const string& text){
  regex rgx(regEx);
  smatch smatch;
  if (regex_search(text, smatch, rgx)) {
    cout << regEx << text << smatch[0] << " " << smatch[1]
    << " "<< smatch[2] << " " << smatch[3] << endl;
  }
}

showCaptureGroups("abc+", "abccccc");
showCaptureGroups("(a+)(b+)", "aaabccc");
showCaptureGroups("((a+)(b+))", "aaabccc");
showCaptureGroups("(ab)(abc)+", "ababcabc");

`td::sub_match`

The capture groups are of type std::sub_match. As with std::match_results, C++ defines the following four type synonyms.

typedef sub_match<const char*> csub_match;
typedef sub_match<const wchar_t*> wcsub_match;
typedef sub_match<string::const_iterator> ssub_match;
typedef sub_match<wstring::const_iterator> wssub_match;

You can further analyze the capture group cap.

Member Function	Description
`cap.matched()`	Indicates if this match was successful.
`cap.first()` and `cap.end()`	Returns the begin and end iterator of the character sequence.
`cap.length()`	Returns the length of the capture group.
`cap.str()`	Returns the capture group as [a] string.
`cap.compare(other)`	Compares the current capture group with the `other` capture group.

Here is a code snippet showing the interplay between the search result std::match_results and its capture groups std::sub_match's:

// subMatch.cpp
...
#include <regex>
...
using std::cout;

std::string privateAddress="192.168.178.21";
std::string regEx(R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))");
std::regex rgx(regEx);
std::smatch smatch;

if (std::regex_match(privateAddress, smatch, rgx)) {

  for (auto cap: smatch) {
    cout << "capture group: " << cap << '\n';
    if (cap.matched) {
      std::for_each(cap.first,
                    cap.second,
                    [](int v) {
        cout << std::hex << v << " ";
                              });
      cout << '\n';
    }
  } // for

}

...

capture group: 192.168.178.21
31 39 32 2e 31 36 38 2e 31 37 38 2e 32 31

capture group: 192
31 39 32

capture group: 168
31 36 38

capture group: 178
31 37 38

capture group: 21
32 31

The regular expression regEx stands for an IPv4 address. regEx extracts the address's components using capture groups. Finally, the capture groups and the characters in ASCII are displayed in hexadecimal values.

Matching

std::regex_match determines if the text matches a text pattern. You can further analyze the search result, which is of type std::match_results, and is set by a different STL global: std::regex_search.

An Example

The code snippet below shows three simple applications of std::regex_match: a C string, a C++ string, and a range returning only a boolean. The three variants are available for std::match_results objects, respectively.

// match.cpp
...
#include <regex>
...
std::string numberRegEx(R"([-+]?([0-9]*\.[0-9]+|[0-9]+))");
std::regex rgx(numberRegEx);
const char* numChar{"2011"};

if (std::regex_match(numChar, rgx)) {
  std::cout << numChar << "is a number." << '\n';
}
// 2011 is a number.

const std::string numStr{"3.14159265359"};
if (std::regex_match(numStr, rgx)){
  std::cout << numStr << " is a number." << '\n';
}
// 3.14159265359 is a number.

const std::vector<char> numVec{{'-', '2', '.', '7', '1', '8', '2',
'8', '1', '8', '2', '8'}};
if (std::regex_match(numVec.begin(), numVec.end(), rgx)) {
  for (auto c: numVec) { std::cout << c ;};
  std::cout << "is a number." << '\n';
} // if
// -2.718281828 is a number.

`std::regex_match` (C++11)

Its constructors may take as its first argument(s):

a beginning and an end iterators, or
a pointer to const CHAR (const CHAR*), or
a constant reference to a string (std::basic_string<CHAR> &)

These one or two paramenters may be followed by a std::match_results non-constant reference.

The next parameter is mandatory: a reference to a std::basic_regex object

Last is an optional flags parameter.

`regex_match` only considers full matches

Because regex_match only considers full matches, the same regex may give different matches between std::regex_match and std::regex_search:

std::regex re("Get|GetValue");
std::cmatch m;
std::regex_search("GetValue", m, re);  // returns true, and m[0] contains "Get"
std::regex_match ("GetValue", m, re);  // returns true, and m[0] contains "GetValue"
std::regex_search("GetValues", m, re); // returns true, and m[0] contains "Get"
std::regex_match ("GetValues", m, re); // returns false

`std::regex_constants::match_flag_type` Flags

Their type is implementation-defined.

Name	Explanation
`match_not_bol`	The first character in [first, last) will be treated as if it is not at the beginning of a line (i.e. ^ will not match [first, first)).
`match_not_eol`	The last character in [first, last) will be treated as if it is not at the end of a line (i.e. $ will not match [last, last)).
`match_not_bow`	\b will not match [first, first).
`match_not_eow`	\b will not match [last, last).
`match_any`	If more than one match is possible, then any match is an acceptable result.
`match_not_null`	Do not match empty sequences.
`match_continuous`	Only match a sub-sequence that begins at first.
`match_prev_avail`	--first is a valid iterator position.
`When`	set, causes match_not_bol and match_not_bow to be ignored.
`format_default`	Use ECMAScript rules to construct strings in std::regex_replace (syntax documentation).
`format_sed`	Use POSIX sed utility rules in std::regex_replace (syntax documentation).
`format_no_copy`	Do not copy un-matched strings to the output in std::regex_replace.
`format_first_only`	Only replace the first match in std::regex_replace.

All constants, except for match_default and format_default, are bitmask elements. The match_default and format_default constants are empty bitmasks.

Searching

std::regex_search<CHAR> checks if the text contains a text pattern. You can use the function with and without a std::match_results object and apply it to a C string, a C++ string, or a range.

An Example

The example below shows how to use std::regex_search with texts of type const char*, std::string, const wchar_t*, and std::wstring.

// search.cpp
...
#include <regex>
...

// regular expression holder for time
std::regex crgx("([01]?[0-9]|2[0-3]):[0-5][0-9]");

// const char*
std::cmatch cmatch;

const char* ctime{"Now it is 23:10." };
if (std::regex_search(ctime, cmatch, crgx)) {
  std::cout << ctime << '\n';
  std::cout << "Time: " << cmatch[0] << '\n'; // Time: 23:10
}

// std::string
std::smatch smatch;
std::string stime{"Now it is 23:25." };
if (std::regex_search(stime, smatch, crgx)) {
  std::cout << stime << '\n';
  std::cout << "Time: " << smatch[0] << '\n'; // Time: 23:25
}

// regular expression holder for time
std::wregex wrgx(L"([01]?[0-9]|2[0-3]):[0-5][0-9]");

// const wchar_t*
std::wcmatch wcmatch;

const wchar_t* wctime{L "Now it is 23:47." };
if (std::regex_search(wctime, wcmatch, wrgx)) {
  std::wcout << wctime << '\n';
  std::wcout << "Time: " << wcmatch[0] << '\n'; // Time: 23:47
}

// std::wstring
std::wsmatch wsmatch;

std::wstring wstime{L "Now it is 00:03." };
if (std::regex_search(wstime, wsmatch, wrgx)) {
  std::wcout << wstime << '\n';
  std::wcout << "Time: " << wsmatch[0] << '\n'; // Time: 00:03
}

`std::regex_search` (C++11)

Determines if there is a match between the regular expression e and some subsequence in the target character sequence. The detailed match result is stored in input-output parameter std::match_results<> m (if present).

The return type is bool.

The main difference between std::regex_match and std::regex_search is...

Replacing

std::regex_replace replaces sequences in a text matching a text pattern. It returns in the simple form std::regex_replace(text, regex, replString) its result as string. The function replaces an occurrence of regex in text with replString.

// replace.cpp
...
#include <regex>
...
using namespace std;

string future{"Future"};
string unofficialName{
  "The unofficial name of the new C++ standard is C++0x."};

regex rgxCpp{R"(C\+\+0x)"};
string newCppName{"C++11"};
string newName{regex_replace(unofficialName, rgxCpp, newCppName)};

regex rgxOff{"unofficial"};
string makeOfficial{"official"};
string officialName{regex_replace(newName, rgxOff, makeOfficial)};

cout << officialName << endl;
            // The official name of the new C++ standard is C++11.

In addition to the simple version, C++ has a version of std::regex_replace working on ranges. It enables you to push the modified string directly into another string:

typedef basic_regex<char> regex;
std::string str2;
std::regex_replace(std::back_inserter(str2),
                   text.begin(), text.end(),
                   regex,replString);

All variants of std::regex_replace have an additional optional parameter. If you set the parameter to std::regex_constants::format_no_copy, you will get the part of the text matching the regular expression. The unmatched text is not copied. If you set the parameter to std::regex_constants::format_first_only, then std::regex_replace will only be applied once.

Formatting

std::regex_replace and std::match_results.format in combination with capture groups enables you to format text. You can use a format string together with a placeholder to insert the value.

Here are both possibilities, first with regex:

// format.cpp
...
#include <regex>
...
std::string future{"Future"};
const std::string unofficial{"unofficial, C++0x"};
const std::string official{"official, C++11"};

std::regex regValues{"(.*),(.*)"};
std::string standardText{"The $1 name of the new C++ standard is $2."};
std::string textNow = std::regex_replace(unofficial, regValues, standardText);
std::cout << textNow << '\n';
  // The unofficial name of the new C++ standard is C++0x.

std::smatch smatch;
if (std::regex_match(official, smatch, regValues)) {
  std::cout << smatch.str(); // official,C++11
  std::string textFuture = smatch.format(standardText);
  std::cout << textFuture << '\n';
} // The official name of the new C++ standard is C++11.

In the function call std::regex_replace(unoffical, regValues, standardText), the text matching the first and second capture group of the regular expression regValues is extracted from the string unofficial. The placeholders $1 and $2 in the text standardText are then replaced by the extracted values. The strategy of smatch.format(standardTest) is similar, but there is a difference:

The creation of the search results smatch is separated from their usage when formatting the string.

In addition to capture groups, C++ supports additional format escape sequences. You can use them in format strings:

Format escape sequence	Description
`$&`	Returns the total match (0th capture group).
`$$`	Returns $.
$` (backward tic)	Returns the text before the total match.
`$Â´` (forward tic)	Returns the text after the total match.
`‘$ i’`	Returns the ith capture group.

Repeated Search

It's pretty convenient to iterate with std::regex_iterator and std::regex_token_iterator through the matched texts. std::regex_iterator supports the matches and their capture groups. std::regex_token_iterator supports more. You can address the components of each capture.

Using a negative index enables it to access the text between the matches.

`std::regex_iterator`

C++ defines the following four type synonyms for std::regex_iterator:

typedef  cregex_iterator regex_iterator<const char*>
typedef wcregex_iterator regex_iterator<const wchar_t*>
typedef  sregex_iterator regex_iterator<std::string::const_iterator>
typedef wsregex_iterator regex_iterator<std::wstring::const_iterator>

You may use std::regex_iterator to count the occurrences of the words in a text:

// regexIterator.cpp
...
#include <regex>
#include <unordered_map>
...
using std::cout;

std::string text{"That's a (to me) amazingly frequent question. It may be the most freque\
ntly asked question. Surprisingly, C++11 feels like a new language: The pieces just fit t\
ogether better than they used to, and I find a higher-level style of programming more nat\
ural than before and as efficient as ever." };

std::regex wordReg{R"(\w+)"};
      std::sregex_iterator wordItBegin(text.begin(), text.end(), wordReg);
const std::sregex_iterator wordItEnd;
std::unordered_map<std::string, std::size_t> allWords;
for (; wordItBegin != wordItEnd; ++wordItBegin) {
  ++allWords[wordItBegin->str()];
}
for (auto wordIt: allWords)
  cout << "(" << wordIt.first << ":"
       << wordIt.second << ")";
// (as:2)(of:1)(level:1)(find:1)(ever:1)(and:2)(natural:1)

A word consists of a least one word-character (\w+). This regular expression is used to define the begin iterator wordItBegin, then the end iterator wordItEnd is defined (default constructor).

The iteration through the matches happens in the for loop. Each word increments the counter: ++allWords[wordItBegin]->str()]. A word whose counter equals 1 is created if it is not already in allWords.

`std::regex_token_iterator`

C++ defines the following four type synonyms for std::regex_token_iterator:

typedef  cregex_token_iterator regex_token_iterator<const char*>
typedef wcregex_token_iterator regex_token_iterator<const wchar_t*>
typedef  sregex_token_iterator regex_token_iterator<std::string::const_iterator>
typedef wsregex_token_iterator regex_token_iterator<std::wstring::const_iterator>

std::regex_token_iterator enables you to use indexes to explicitly specify which capture groups you are interested in. If you don't specify the index, you will get all capture groups, even though you can also request specific capture groups using their respective index.

The -1 index is particular: You can use -1 to address the text between the matches.

// tokenIterator.cpp
...
using namespace std;

std::string text{"Pete Becker, The C++ Standard Library Extensions, 2006:"
"Nicolai Josuttis, The C++ Standard Library, 1999:"
"Andrei Alexandrescu, Modern C++ Design, 2001"};

regex regBook(R"((\w+)\s(\w+),([\w\s\+]*),(\d{4}))");
sregex_token_iterator bookItBegin(text.begin(), text.end(), regBook);Regular Expressions
const sregex_token_iterator bookItEnd;
while (bookItBegin != bookItEnd){
cout << *bookItBegin++ << endl;
}
// Pete Becker,The C++ Standard Library Extensions,2006
// Nicolai Josuttis,The C++ Standard Library,1999

sregex_token_iterator bookItNameIssueBegin(text.begin(),
                                           text.end(),
                                           regBook, {{2,4}});
const sregex_token_iterator bookItNameIssueEnd;

Removing Greediness

To make regular expression repetitions non-greedy, a ? can be added behind the repeat as in *? , +? , ?? , and {...}? . A non-greedy repetition repeats its pattern as few times as possible while still matching the remainder of the regular expression.

Regular Expressions in the STL

Regular Expression Functions Provided

Different Regular Expression Grammars

ECMAScript Syntax

C++-Only Differences

basic Syntax

extended Syntax

awk Syntax

grep Syntax

egrep Syntax

Using Raw Strings

Procedure for Applying Regular Expressions

Text Types

Regular Expression Objects

The Search Result match_results*

td::sub_match

Matching

An Example

std::regex_match (C++11)

std::regex_constants::match_flag_type Flags

Searching

An Example

std::regex_search (C++11)

Replacing

Formatting

Repeated Search

std::regex_iterator

std::regex_token_iterator

Removing Greediness

The Search Result `match_results`*

`td::sub_match`

`std::regex_match` (C++11)

`std::regex_constants::match_flag_type` Flags

`std::regex_search` (C++11)

`std::regex_iterator`

`std::regex_token_iterator`