Handling Text in C++: Tokenizing/Analysing/Lexing...
Tokenizing
Tokenizing a string programmatically means splitting a string with respect to some delimiter(s). There are serveral ways to tokenize a string.
Using std::getline
and stringstream
We shall be relying on std::getline(ISTREAM, STRING, DELIMITER) to...
A stringstream associates a string object with a stream allowing you to read from the string as if it were a stream.
Below is a C++ implementation:
// Tokenizing a string using stringstream #include <iostream> #include <sstream> using namespace std; int main() { string line = "GeeksForGeeks is a must try"; // Declare a vector of string to save tokens: vector <string> tokens; // Declare stringstream 'check1' to extract tokens from: stringstream check1(line); string intermediate; // Tokenizing space ' '; while(getline(check1, intermediate, ' ')) { tokens.push_back(intermediate); } // Printing the token vector: for(int i = 0; i < tokens.size(); i++) cout << tokens[i] << '\n'; return 0; }
Using C's strtok()
strtok()
splits a C-string according to given delimiters and returns the next token. It needs to be called in a loop to get all tokens. It returns NULL
when there are no more tokens.
Prototype:
char * strtok(char str[], const char *delims);
Below is a C++ demonstration:
// C/C++ program for splitting a string // using strtok() #include <stdio.h> #include <string.h> int main() { char str[] = "Geeks-for-Geeks"; // get first token: char *token = strtok(str, "-"); // Keep printing tokens while one of the // delimiters present in str[]. while (token != NULL) { printf("%s\n", token); token = strtok(NULL, "-"); } return 0; }
[...]
Using C's strtok_r()
Just like strtok()
function in C, strtok_r()
parses a string into a sequence of tokens. strtok_r()
is a reentrant version of strtok()
.
char *strtok_r( char * str, const char * delim, char ** saveptr);
There are two ways we can call strtok_r().
The third argument saveptr is a pointer to a char *
variable that is used internally by strtok_r()
in order to maintain context between successive calls that parse the same string.
Below is a simple C++ program to show the use of strtok_r()
:
#include<stdio.h> #include<string.h> int main() { char str[] = "Geeks for Geeks"; char *token; char *rest = str; while ((token = strtok_r(rest, " ", &rest))) printf("%s\n", token); return(0); }
[...]
Using std::sregex_token_iterator
In this method the tokenization is done on the basis of regex matches. Better for use cases when multiple delimiters are needed.
Below is a simple C++ program to show the use of std::sregex_token_iterator
:
#include <iostream> #include <regex> #include <string> #include <vector> /* Tokenize the given vector according to the regex and remove the empty tokens. */ std::vector<std::string> tokenize( const std::string str, const std::regex re) { std::sregex_token_iterator it{ str.begin(), str.end(), re, -1 }; std::vector<std::string> tokenized{ it, {} }; // Additional check to remove empty strings tokenized.erase( std::remove_if(tokenized.begin(), tokenized.end(), [](std::string const& s) { return s.size() == 0; }), tokenized.end()); return tokenized; } // Driver Code int main() { const std::string str = "Break string a,spaces,and,commas"; const std::regex re(R"([\s|,]+)"); // Function Call const std::vector<std::string> tokenized = tokenize(str, re); for (std::string token : tokenized) std::cout << token << std::endl; return 0; }
Lexing*
A lexer (or lexical analyzer) is a program that takes a stream of raw input characters, such as source code, and breaks it down into a sequence of meaningful units called tokens. These tokens represent keywords, operators, identifiers, numbers, and other linguistic components, which are then passed to a parser for syntactic analysis.
How it works:
- Input:: The lexer receives a string of characters (e.g., int x = 1;) as input.
- Tokenization:: It scans the input, identifying patterns and classifying sequences of characters into specific token categories.
- Output:: The output is a stream of tokens, where each token typically includes its type (like "keyword," "identifier," or "operator") and its actual value. For example, int x = 1; would be broken into tokens: int (keyword), x (identifier), = (operator), 1 (number), and ; (punctuation).
- Discarding Whitespace:: A lexer also typically ignores whitespace and comments, which are not considered part of the core language structure.
Purpose in Compilers:
The lexer is the first phase in the compilation process. By converting the input characters into a more manageable and structured stream of tokens, it simplifies the task of the parser, which handles the grammatical correctness of the code.
Analogy:
Think of a lexer as the process of breaking down a sentence into individual words and punctuation marks (tokens). The parser would then take these words and understand the overall structure and meaning of the sentence.
(From AI Overview, by Google)