Regular Expressions in the STL

Regular expressions, defined in the <regex> header, are a powerful feature of the Standard Library. They are a special mini-language for string processing. They might seem complicated at first, but once you get to know them, they make working with strings easier.

Regular Expression Functions Provided

Some of the functions provided are:

Some simple operations are next exemplified:

#include <regex>
#include <iostream>

int main (int argc, const char * argv[]) {
    std::regex r("st|mt|tr");
    std::cerr << "st|mt|tr" << " matches st? " << std::regex_match("st", r) << std::endl;
    std::cerr << "st|mt|tr" << " matches mt? " << std::regex_match("mt", r) << std::endl;
    std::cerr << "st|mt|tr" << " matches spruce? " << std::regex_match("spruce", r) << std::endl;

    return 0;
}

Different Regular Expression Grammars

There are several different grammars for regular expressions. For this reason, C++ includes support for several of these grammars: ECMAScript, basic, extended, awk, grep, and egrep. If you already know any of these regular expression grammars, you can use it straight away in C++ by telling the regular expression library to use that specific syntax ( syntax_option_type ). The default grammar in C++ is ECMAScript whose syntax is explained in detail in the following section. It is also the most powerful grammar, so it's recommended to use ECMAScript instead of one of the other more limited grammars. Explaining the other regular expression grammars falls outside the scope of this section.

ECMAScript Syntax

The ECMAScript 3 regular expression grammar in C++ is ECMA-262 grammar with modifications marked with (C++ only) below.

The modified regular expression grammar is mostly ECMAScript RegExp grammar with a POSIX-type expansion on locales under ClassAtom. Some clarifications on equality checks and number parsing is made.


The normative references in the standard specifies ECMAScript 3.

See the MDN Guide on JavaScript RegExp for an overview on the dialect features.


ECMAScript syntax recognizes the following special characters:

^ $ \ . * + ? ( ) [ ] { } |

If you need to match one of these special characters, you need to escape it using the \ character. For example:

\[ or \. or \* or \\

C++-Only Differences

basic Syntax


          


        

extended Syntax


          


        

awk Syntax


          


        

grep Syntax


          


        

egrep Syntax


          


        

Using Raw Strings

Use raw string literals in regular expressions.

The regular expression for the text C++ is fairly unwieldly: C\\+\\+. You have to use two backslashes for each + sign. First, the + sign is a unique character in a regular expression. Second, the backslash is a special character in a string. Therefore one backslash escapes the + sign; the other backslash escapes the backslash. By using a raw string literal, the second backslash is not necessary anymore because the backslash is not interpreted in the string.

#include <regex>

//...

std::string regExpr("C\\+\\+");
std::string regExprRaw(R"(C\+\+)");

Procedure for Applying Regular Expressions

Define the regular expression [object]
std::string text="C++ or c++.";
std::string regExpr(R"(C\+\+)");
std::regex rgx(regExpr);
Store the result of the search
std::smatch result;
std::regex_search(text, result, rgx);
Process the result

std::cout << result[0] << '\n';

Text Types

The text type determines the character type of the regular expression and the type of the search result.

The table below shows the four different combinations.

Text type Regular expression type Result type
const char* std::regex std::cmatch
std::string std::regex std::smatch
const wchar_t* std::wregex std::wcmatch
std::wstring std::wregex std::wsmatch

Regular Expression Objects

Objects of type regular expression are instances of the class template template <class charT, class traits= regex_traits <charT>> class basic_regex parametrized by their character type and traits class. The traits class defines the interpretation of the properties of regular grammar. There are two type synonyms in C++:

typedef basic_regex<char> regex;
typedef basic_regex<wchar_t> wregex;

You can further customize the object of type regular expression. Therefore you can specify the grammar used or adapt the syntax. As mentioned, C++ supports the basic, extended, awk, grep, and egrep grammars.

A regular expression qualified by the std::regex_constants::icase flag is case insensitive. If you want to adopt the syntax, you have to specify the grammar explicitly.

// regexGrammar.cpp
...
#include <regex>

...

using std::regex_constants::ECMAScript;
using std::regex_constants::icase;

std::string theQuestion="C++ or c++, that's the question.";
std::string regExprStr(R"(c\+\+)");

std::regex rgx(regExprStr);
std::smatch smatch;

if (std::regex_search(theQuestion, smatch, rgx)){
std::cout << "case sensitive: " << smatch[0];
}
std::regex rgxIn(regExprStr, ECMAScript|icase);
if (std::regex_search(theQuestion, smatch, rgxIn)){
std::cout << "case insensitive: " << smatch[0];
}

If you use the case-sensitive regular expression rgx, the result of the search in the text theQuestion is c++. That's not the case if your case-insensitive regular expression rgxIn is applied. Now you get the match string C++.

The Search Result match_results*

The object of type std::match_results is the result of a std::regex_match or std::regex_search.

std::match_results is a sequence container having at least one capture group of a std::sub_match object. The std::sub_match objects are sequences of characters.

C++ has four typedef's for std::match_results:

typedef match_results<const char*> cmatch;
typedef match_results<const wchar_t*> wcmatch;
typedef match_results<string::const_iterator> smatch;
typedef match_results<wstring::const_iterator> wsmatch;

The search result std::smatch has a powerful interface.

Member Function Description
smatch.size() Returns the number of capture groups.
smatch.empty() Returns if the search result has a capture group.
smatch[i] Returns the ith capture group.
smatch.length(i) Returns the length of the ith capture group.
smatch.position(i) Returns the position of the ith capture group.
smatch.str(i) Returns the ith capture group as string.
smatch.prefix() and smatch.suffix() Returns the string before and after the capture group.
smatch.begin() and smatch.end() Returns the begin and end iterator for the capture groups.
smatch.format(...) Formats std::smatch objects for the output.

The following program shows the output of the first four capture groups for different regular expressions.

// captureGroups.cpp
...
#include <regex>
...
using namespace std;

void showCaptureGroups(const string& regEx, const string& text){
  regex rgx(regEx);
  smatch smatch;
  if (regex_search(text, smatch, rgx)) {
    cout << regEx << text << smatch[0] << " " << smatch[1]
    << " "<< smatch[2] << " " << smatch[3] << endl;
  }
}

showCaptureGroups("abc+", "abccccc");
showCaptureGroups("(a+)(b+)", "aaabccc");
showCaptureGroups("((a+)(b+))", "aaabccc");
showCaptureGroups("(ab)(abc)+", "ababcabc");

td::sub_match

The capture groups are of type std::sub_match. As with std::match_results, C++ defines the following four type synonyms.

typedef sub_match<const char*> csub_match;
typedef sub_match<const wchar_t*> wcsub_match;
typedef sub_match<string::const_iterator> ssub_match;
typedef sub_match<wstring::const_iterator> wssub_match;

You can further analyze the capture group cap.

Member Function Description
cap.matched() Indicates if this match was successful.
cap.first() and cap.end() Returns the begin and end iterator of the character sequence.
cap.length() Returns the length of the capture group.
cap.str() Returns the capture group as [a] string.
cap.compare(other) Compares the current capture group with the other capture group.

Here is a code snippet showing the interplay between the search result std::match_results and its capture groups std::sub_match's:

// subMatch.cpp
...
#include <regex>
...
using std::cout;

std::string privateAddress="192.168.178.21";
std::string regEx(R"((\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3}))");
std::regex rgx(regEx);
std::smatch smatch;

if (std::regex_match(privateAddress, smatch, rgx)) {

  for (auto cap: smatch) {
    cout << "capture group: " << cap << '\n';
    if (cap.matched) {
      std::for_each(cap.first,
                    cap.second,
                    [](int v) {
        cout << std::hex << v << " ";
                              });
      cout << '\n';
    }
  } // for

}

...

capture group: 192.168.178.21
31 39 32 2e 31 36 38 2e 31 37 38 2e 32 31

capture group: 192
31 39 32

capture group: 168
31 36 38

capture group: 178
31 37 38

capture group: 21
32 31

The regular expression regEx stands for an IPv4 address. regEx extracts the address's components using capture groups. Finally, the capture groups and the characters in ASCII are displayed in hexadecimal values.

Matching

std::regex_match determines if the text matches a text pattern. You can further analyze the search result, which is of type std::match_results, and is set by a different STL global: std::regex_search.

An Example

The code snippet below shows three simple applications of std::regex_match: a C string, a C++ string, and a range returning only a boolean. The three variants are available for std::match_results objects, respectively.

// match.cpp
...
#include <regex>
...
std::string numberRegEx(R"([-+]?([0-9]*\.[0-9]+|[0-9]+))");
std::regex rgx(numberRegEx);
const char* numChar{"2011"};

if (std::regex_match(numChar, rgx)) {
  std::cout << numChar << "is a number." << '\n';
}
// 2011 is a number.

const std::string numStr{"3.14159265359"};
if (std::regex_match(numStr, rgx)){
  std::cout << numStr << " is a number." << '\n';
}
// 3.14159265359 is a number.

const std::vector<char> numVec{{'-', '2', '.', '7', '1', '8', '2',
'8', '1', '8', '2', '8'}};
if (std::regex_match(numVec.begin(), numVec.end(), rgx)) {
  for (auto c: numVec) { std::cout << c ;};
  std::cout << "is a number." << '\n';
} // if
// -2.718281828 is a number.

std::regex_match (C++11)

Its constructors may take as its first argument(s):

  • a beginning and an end iterators, or
  • a pointer to const CHAR (const CHAR*), or
  • a constant reference to a string (std::basic_string<CHAR> &)

These one or two paramenters may be followed by a std::match_results non-constant reference.

The next parameter is mandatory: a reference to a std::basic_regex object

Last is an optional flags parameter.

std::regex_constants::match_flag_type Flags

Their type is implementation-defined.

Name Explanation
match_not_bol The first character in [first, last) will be treated as if it is not at the beginning of a line (i.e. ^ will not match [first, first)).
match_not_eol The last character in [first, last) will be treated as if it is not at the end of a line (i.e. $ will not match [last, last)).
match_not_bow \b will not match [first, first).
match_not_eow \b will not match [last, last).
match_any If more than one match is possible, then any match is an acceptable result.
match_not_null Do not match empty sequences.
match_continuous Only match a sub-sequence that begins at first.
match_prev_avail --first is a valid iterator position.
When set, causes match_not_bol and match_not_bow to be ignored.
format_default Use ECMAScript rules to construct strings in std::regex_replace (syntax documentation).
format_sed Use POSIX sed utility rules in std::regex_replace (syntax documentation).
format_no_copy Do not copy un-matched strings to the output in std::regex_replace.
format_first_only Only replace the first match in std::regex_replace.

All constants, except for match_default and format_default, are bitmask elements. The match_default and format_default constants are empty bitmasks.

Searching

std::regex_search<CHAR> checks if the text contains a text pattern. You can use the function with and without a std::match_results object and apply it to a C string, a C++ string, or a range.

An Example

The example below shows how to use std::regex_search with texts of type const char*, std::string, const wchar_t*, and std::wstring.

// search.cpp
...
#include <regex>
...

// regular expression holder for time
std::regex crgx("([01]?[0-9]|2[0-3]):[0-5][0-9]");

// const char*
std::cmatch cmatch;

const char* ctime{"Now it is 23:10." };
if (std::regex_search(ctime, cmatch, crgx)) {
  std::cout << ctime << '\n';
  std::cout << "Time: " << cmatch[0] << '\n'; // Time: 23:10
}

// std::string
std::smatch smatch;
std::string stime{"Now it is 23:25." };
if (std::regex_search(stime, smatch, crgx)) {
  std::cout << stime << '\n';
  std::cout << "Time: " << smatch[0] << '\n'; // Time: 23:25
}

// regular expression holder for time
std::wregex wrgx(L"([01]?[0-9]|2[0-3]):[0-5][0-9]");

// const wchar_t*
std::wcmatch wcmatch;

const wchar_t* wctime{L "Now it is 23:47." };
if (std::regex_search(wctime, wcmatch, wrgx)) {
  std::wcout << wctime << '\n';
  std::wcout << "Time: " << wcmatch[0] << '\n'; // Time: 23:47
}

// std::wstring
std::wsmatch wsmatch;

std::wstring wstime{L "Now it is 00:03." };
if (std::regex_search(wstime, wsmatch, wrgx)) {
  std::wcout << wstime << '\n';
  std::wcout << "Time: " << wsmatch[0] << '\n'; // Time: 00:03
}

Replacing

std::regex_replace replaces sequences in a text matching a text pattern. It returns in the simple form std::regex_replace(text, regex, replString) its result as string. The function replaces an occurrence of regex in text with replString.

// replace.cpp
...
#include <regex>
...
using namespace std;

string future{"Future"};
string unofficialName{
  "The unofficial name of the new C++ standard is C++0x."};

regex rgxCpp{R"(C\+\+0x)"};
string newCppName{"C++11"};
string newName{regex_replace(unofficialName, rgxCpp, newCppName)};

regex rgxOff{"unofficial"};
string makeOfficial{"official"};
string officialName{regex_replace(newName, rgxOff, makeOfficial)};

cout << officialName << endl;
            // The official name of the new C++ standard is C++11.

In addition to the simple version, C++ has a version of std::regex_replace working on ranges. It enables you to push the modified string directly into another string:

typedef basic_regex<char> regex;
std::string str2;
std::regex_replace(std::back_inserter(str2),
                   text.begin(), text.end(),
                   regex,replString);

All variants of std::regex_replace have an additional optional parameter. If you set the parameter to std::regex_constants::format_no_copy, you will get the part of the text matching the regular expression. The unmatched text is not copied. If you set the parameter to std::regex_constants::format_first_only, then std::regex_replace will only be applied once.

Formatting

std::regex_replace and std::match_results.format in combination with capture groups enables you to format text. You can use a format string together with a placeholder to insert the value.

Here are both possibilities, first with regex:

// format.cpp
...
#include <regex>
...
std::string future{"Future"};
const std::string unofficial{"unofficial, C++0x"};
const std::string official{"official, C++11"};

std::regex regValues{"(.*),(.*)"};
std::string standardText{"The $1 name of the new C++ standard is $2."};
std::string textNow = std::regex_replace(unofficial, regValues, standardText);
std::cout << textNow << '\n';
  // The unofficial name of the new C++ standard is C++0x.

std::smatch smatch;
if (std::regex_match(official, smatch, regValues)) {
  std::cout << smatch.str(); // official,C++11
  std::string textFuture = smatch.format(standardText);
  std::cout << textFuture << '\n';
} // The official name of the new C++ standard is C++11.

In the function call std::regex_replace(unoffical, regValues, standardText), the text matching the first and second capture group of the regular expression regValues is extracted from the string unofficial. The placeholders $1 and $2 in the text standardText are then replaced by the extracted values. The strategy of smatch.format(standardTest) is similar, but there is a difference:

The creation of the search results smatch is separated from their usage when formatting the string.

In addition to capture groups, C++ supports additional format escape sequences. You can use them in format strings:

$`
Format escape sequence Description
$& Returns the total match (0th capture group).
$$ Returns $.
$` (backward tic) Returns the text before the total match.
$´ (forward tic) Returns the text after the total match.
‘$ i’ Returns the ith capture group.

Removing Greediness

To make regular expression repetitions non-greedy, a ? can be added behind the repeat as in *? , +? , ?? , and {...}? . A non-greedy repetition repeats its pattern as few times as possible while still matching the remainder of the regular expression.