Handling Text in C++: Tokenizing/Analysing/Lexing...
Tokenizing
Tokenizing a string programmatically means splitting a string with respect to some delimiter(s). There are serveral ways to tokenize a string.
Using std::getline and stringstream
We shall be relying on std::getline(ISTREAM, STRING, DELIMITER) to...
A stringstream associates a string object with a stream allowing you to read from the string as if it were a stream.
Below is a C++ implementation:
// Tokenizing a string using stringstream
#include <iostream>
#include <sstream>
using namespace std;
int main() {
string line = "GeeksForGeeks is a must try";
// Declare a vector of string to save tokens:
vector <string> tokens;
// Declare stringstream 'check1' to extract tokens from:
stringstream check1(line);
string intermediate;
// Tokenizing space ' ';
while(getline(check1, intermediate, ' '))
{
tokens.push_back(intermediate);
}
// Printing the token vector:
for(int i = 0; i < tokens.size(); i++)
cout << tokens[i] << '\n';
return 0;
}
Using C's strtok()
strtok() splits a C-string according to given delimiters and returns the next token. It needs to be called in a loop to get all tokens. It returns NULL when there are no more tokens.
Prototype:
char * strtok(char str[], const char *delims);
Below is a C++ demonstration:
// C/C++ program for splitting a string
// using strtok()
#include <stdio.h>
#include <string.h>
int main()
{
char str[] = "Geeks-for-Geeks";
// get first token:
char *token = strtok(str, "-");
// Keep printing tokens while one of the
// delimiters present in str[].
while (token != NULL)
{
printf("%s\n", token);
token = strtok(NULL, "-");
}
return 0;
}
[...]
Using C's strtok_r()
Just like strtok() function in C, strtok_r() parses a string into a sequence of tokens. strtok_r() is a reentrant version of strtok().
char *strtok_r( char * str,
const char * delim,
char ** saveptr);
There are two ways we can call strtok_r().
The third argument saveptr is a pointer to a char * variable that is used internally by strtok_r() in order to maintain context between successive calls that parse the same string.
Below is a simple C++ program to show the use of strtok_r():
#include<stdio.h>
#include<string.h>
int main()
{
char str[] = "Geeks for Geeks";
char *token;
char *rest = str;
while ((token = strtok_r(rest, " ", &rest)))
printf("%s\n", token);
return(0);
}
[...]
Using std::sregex_token_iterator
In this method the tokenization is done on the basis of regex matches. Better for use cases when multiple delimiters are needed.
Below is a simple C++ program to show the use of std::sregex_token_iterator:
#include <iostream>
#include <regex>
#include <string>
#include <vector>
/* Tokenize the given vector according to the regex
and remove the empty tokens. */
std::vector<std::string> tokenize(
const std::string str,
const std::regex re)
{
std::sregex_token_iterator it{ str.begin(),
str.end(), re, -1 };
std::vector<std::string> tokenized{ it, {} };
// Additional check to remove empty strings
tokenized.erase(
std::remove_if(tokenized.begin(),
tokenized.end(),
[](std::string const& s) {
return s.size() == 0;
}),
tokenized.end());
return tokenized;
}
// Driver Code
int main()
{
const std::string str = "Break string
a,spaces,and,commas";
const std::regex re(R"([\s|,]+)");
// Function Call
const std::vector<std::string> tokenized =
tokenize(str, re);
for (std::string token : tokenized)
std::cout << token << std::endl;
return 0;
}
Lexing*
A lexer (or lexical analyzer) is a program that takes a stream of raw input characters, such as source code, and breaks it down into a sequence of meaningful units called tokens. These tokens represent keywords, operators, identifiers, numbers, and other linguistic components, which are then passed to a parser for syntactic analysis.
How it works:
- Input:: The lexer receives a string of characters (e.g., int x = 1;) as input.
- Tokenization:: It scans the input, identifying patterns and classifying sequences of characters into specific token categories.
- Output:: The output is a stream of tokens, where each token typically includes its type (like "keyword," "identifier," or "operator") and its actual value. For example, int x = 1; would be broken into tokens: int (keyword), x (identifier), = (operator), 1 (number), and ; (punctuation).
- Discarding Whitespace:: A lexer also typically ignores whitespace and comments, which are not considered part of the core language structure.
Purpose in Compilers:
The lexer is the first phase in the compilation process. By converting the input characters into a more manageable and structured stream of tokens, it simplifies the task of the parser, which handles the grammatical correctness of the code.
Analogy:
Think of a lexer as the process of breaking down a sentence into individual words and punctuation marks (tokens). The parser would then take these words and understand the overall structure and meaning of the sentence.
(From AI Overview, by Google)