C++ Regular Expression Iterators
It's fairly convenient to iterate with std::regex_iterator and std::regex_token_iterator through the matched texts. std::regex_iterator supports the matches and their capture groups. std::regex_token_iterator supports more. You can address the components of each capture.
Using a negative index enables it to access the text between the matches.
std::regex_iterator
C++ defines the following four type synonyms for std::regex_iterator:
typedef cregex_iterator regex_iterator<const char*> typedef wcregex_iterator regex_iterator<const wchar_t*> typedef sregex_iterator regex_iterator<std::string::const_iterator> typedef wsregex_iterator regex_iterator<std::wstring::const_iterator>
You may use std::regex_iterator to count the occurrences of the words in a text:
// regexIterator.cpp
...
#include <regex>
#include <unordered_map>
...
using std::cout;
std::string text{"That's a (to me) amazingly frequent question. It may be the most freque\
ntly asked question. Surprisingly, C++11 feels like a new language: The pieces just fit t\
ogether better than they used to, and I find a higher-level style of programming more nat\
ural than before and as efficient as ever." };
std::regex wordReg{R"(\w+)"};
std::sregex_iterator wordItBegin(text.begin(), text.end(), wordReg);
const std::sregex_iterator wordItEnd;
std::unordered_map<std::string, std::size_t> allWords;
for (; wordItBegin != wordItEnd; ++wordItBegin) {
++allWords[wordItBegin->str()];
}
for (auto wordIt: allWords)
cout << "(" << wordIt.first << ":"
<< wordIt.second << ")";
// (as:2)(of:1)(level:1)(find:1)(ever:1)(and:2)(natural:1)
A word consists of a least one word-character (\w+). This regular expression is used to define the begin iterator wordItBegin, then the end iterator wordItEnd is defined (default constructor).
The iteration through the matches happens in the for loop. Each word increments the counter: ++allWords[wordItBegin]->str()]. A word whose counter equals 1 is created if it is not already in allWords.
Another example, from the CPlusPlus site:
#include <iostream>
#include <string>
#include <regex>
int main ()
{
std::string s ("this subject has a submarine as a subsequence");
std::regex e ("\\b(sub)([^ ]*)"); // matches words beginning by "sub"
std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), e );
std::regex_iterator<std::string::iterator> rend;
while (rit!=rend) {
std::cout << rit->str() << std::endl;
++rit;
}
return 0;
}
[...]
std::regex_token_iterator
C++ defines the following four type synonyms for std::regex_token_iterator:
typedef cregex_token_iterator regex_token_iterator<const char*> typedef wcregex_token_iterator regex_token_iterator<const wchar_t*> typedef sregex_token_iterator regex_token_iterator<std::string::const_iterator> typedef wsregex_token_iterator regex_token_iterator<std::wstring::const_iterator>
std::regex_token_iterator enables you to use indexes to explicitly specify which capture groups you are interested in. If you don't specify the index, you will get all capture groups, even though you can also request specific capture groups using their respective index.
The -1 index is particular: You can use -1 to address the text between the matches.
// tokenIterator.cpp
...
using namespace std;
std::string text{"Pete Becker, The C++ Standard Library Extensions, 2006:"
"Nicolai Josuttis, The C++ Standard Library, 1999:"
"Andrei Alexandrescu, Modern C++ Design, 2001"};
regex regBook(R"((\w+)\s(\w+),([\w\s\+]*),(\d{4}))");
sregex_token_iterator bookItBegin(text.begin(), text.end(), regBook);Regular Expressions
const sregex_token_iterator bookItEnd;
while (bookItBegin != bookItEnd){
cout << *bookItBegin++ << endl;
}
// Pete Becker,The C++ Standard Library Extensions,2006
// Nicolai Josuttis,The C++ Standard Library,1999
sregex_token_iterator bookItNameIssueBegin(text.begin(),
text.end(),
regBook, {{2,4}});
const sregex_token_iterator bookItNameIssueEnd;