Handling Text in C++: Tokenizing/Analysing/Lexing...

Tokenizing

Tokenizing a string programmatically means splitting a string with respect to some delimiter(s). There are serveral ways to tokenize a string.

Using `std::getline` and `stringstream`

We shall be relying on std::getline(ISTREAM, STRING, DELIMITER) to...

A stringstream associates a string object with a stream allowing you to read from the string as if it were a stream.

Below is a C++ implementation:

// Tokenizing a string using stringstream
#include <iostream>
#include <sstream>
using namespace std;

int main() {

  string line = "GeeksForGeeks is a must try";

  // Declare a vector of string to save tokens:
  vector <string> tokens;

  // Declare stringstream 'check1' to extract tokens from:
  stringstream check1(line);

  string intermediate;

  // Tokenizing space ' ';
  while(getline(check1, intermediate, ' '))
  {
      tokens.push_back(intermediate);
  }

  // Printing the token vector:
  for(int i = 0; i < tokens.size(); i++)
    cout << tokens[i] << '\n';

  return 0;

}

Using C's `strtok()`

strtok() splits a C-string according to given delimiters and returns the next token. It needs to be called in a loop to get all tokens. It returns NULL when there are no more tokens.

Prototype:

char * strtok(char str[], const char *delims);

Below is a C++ demonstration:

// C/C++ program for splitting a string
// using strtok()
#include <stdio.h>
#include <string.h>

int main()
{
  char str[] = "Geeks-for-Geeks";

  // get first token:
  char *token = strtok(str, "-");

  // Keep printing tokens while one of the
  // delimiters present in str[].
  while (token != NULL)
  {
    printf("%s\n", token);
    token = strtok(NULL, "-");
  }

  return 0;
}

[...]

Using C's `strtok_r()`

Just like strtok() function in C, strtok_r() parses a string into a sequence of tokens. strtok_r() is a reentrant version of strtok().

char *strtok_r(      char * str,
               const char * delim,
                     char ** saveptr);

There are two ways we can call strtok_r().

The third argument saveptr is a pointer to a char * variable that is used internally by strtok_r() in order to maintain context between successive calls that parse the same string.

Below is a simple C++ program to show the use of strtok_r():

#include<stdio.h>
#include<string.h>

int main()
{
  char str[] = "Geeks for Geeks";
  char *token;
  char *rest = str;

  while ((token = strtok_r(rest, " ", &rest)))
    printf("%s\n", token);

  return(0);
}

[...]

Using `std::sregex_token_iterator`

In this method the tokenization is done on the basis of regex matches. Better for use cases when multiple delimiters are needed.

Below is a simple C++ program to show the use of std::sregex_token_iterator:

#include <iostream>
#include <regex>
#include <string>
#include <vector>

/* Tokenize the given vector according to the regex
   and remove the empty tokens. */

std::vector<std::string> tokenize(
                     const std::string str,
                          const std::regex re)
{
    std::sregex_token_iterator it{ str.begin(),
                             str.end(), re, -1 };
    std::vector<std::string> tokenized{ it, {} };

    // Additional check to remove empty strings
    tokenized.erase(
        std::remove_if(tokenized.begin(),
                            tokenized.end(),
                       [](std::string const& s) {
                           return s.size() == 0;
                       }),
        tokenized.end());

    return tokenized;
}

// Driver Code
int main()
{
    const std::string str = "Break string
                   a,spaces,and,commas";
    const std::regex re(R"([\s|,]+)");

    // Function Call
    const std::vector<std::string> tokenized =
                           tokenize(str, re);

    for (std::string token : tokenized)
        std::cout << token << std::endl;
    return 0;
}

Lexing*

A lexer (or lexical analyzer) is a program that takes a stream of raw input characters, such as source code, and breaks it down into a sequence of meaningful units called tokens. These tokens represent keywords, operators, identifiers, numbers, and other linguistic components, which are then passed to a parser for syntactic analysis.

How it works:

Input:: The lexer receives a string of characters (e.g., int x = 1;) as input.

Tokenization:: It scans the input, identifying patterns and classifying sequences of characters into specific token categories.

Output:: The output is a stream of tokens, where each token typically includes its type (like "keyword," "identifier," or "operator") and its actual value. For example, int x = 1; would be broken into tokens: int (keyword), x (identifier), = (operator), 1 (number), and ; (punctuation).

Discarding Whitespace:: A lexer also typically ignores whitespace and comments, which are not considered part of the core language structure.

Purpose in Compilers:

The lexer is the first phase in the compilation process. By converting the input characters into a more manageable and structured stream of tokens, it simplifies the task of the parser, which handles the grammatical correctness of the code.

Analogy:

Think of a lexer as the process of breaking down a sentence into individual words and punctuation marks (tokens). The parser would then take these words and understand the overall structure and meaning of the sentence.

(From AI Overview, by Google)

Handling Text in C++: Tokenizing/Analysing/Lexing...

Tokenizing

Using std::getline and stringstream

Using C's strtok()

Using C's strtok_r()

Using std::sregex_token_iterator

Lexing*

Using `std::getline` and `stringstream`

Using C's `strtok()`

Using C's `strtok_r()`

Using `std::sregex_token_iterator`