Handling Text in C++: Tokenizing/Analysing/Lexing...

Tokenizing

Tokenizing a string programmatically means splitting a string with respect to some delimiter(s). There are serveral ways to tokenize a string.

Using std::getline and stringstream

We shall be relying on std::getline(ISTREAM, STRING, DELIMITER) to...

A stringstream associates a string object with a stream allowing you to read from the string as if it were a stream.

Below is a C++ implementation:

// Tokenizing a string using stringstream
#include <iostream>
#include <sstream>
using namespace std;

int main() {

  string line = "GeeksForGeeks is a must try";

  // Declare a vector of string to save tokens:
  vector <string> tokens;

  // Declare stringstream 'check1' to extract tokens from:
  stringstream check1(line);

  string intermediate;

  // Tokenizing space ' ';
  while(getline(check1, intermediate, ' '))
  {
      tokens.push_back(intermediate);
  }

  // Printing the token vector:
  for(int i = 0; i < tokens.size(); i++)
    cout << tokens[i] << '\n';

  return 0;

}

Using C's strtok()

strtok() splits a C-string according to given delimiters and returns the next token. It needs to be called in a loop to get all tokens. It returns NULL when there are no more tokens.

Prototype:

char * strtok(char str[], const char *delims);

Below is a C++ demonstration:

// C/C++ program for splitting a string
// using strtok()
#include <stdio.h>
#include <string.h>

int main()
{
  char str[] = "Geeks-for-Geeks";

  // get first token:
  char *token = strtok(str, "-");

  // Keep printing tokens while one of the
  // delimiters present in str[].
  while (token != NULL)
  {
    printf("%s\n", token);
    token = strtok(NULL, "-");
  }

  return 0;
}

[...]

Using C's strtok_r()

Just like strtok() function in C, strtok_r() parses a string into a sequence of tokens. strtok_r() is a reentrant version of strtok().

char *strtok_r(      char * str,
               const char * delim,
                     char ** saveptr);

There are two ways we can call strtok_r().

The third argument saveptr is a pointer to a char * variable that is used internally by strtok_r() in order to maintain context between successive calls that parse the same string.

Below is a simple C++ program to show the use of strtok_r():

#include<stdio.h>
#include<string.h>

int main()
{
  char str[] = "Geeks for Geeks";
  char *token;
  char *rest = str;

  while ((token = strtok_r(rest, " ", &rest)))
    printf("%s\n", token);

  return(0);
}

[...]

Using std::sregex_token_iterator

In this method the tokenization is done on the basis of regex matches. Better for use cases when multiple delimiters are needed.

Below is a simple C++ program to show the use of std::sregex_token_iterator:

#include <iostream>
#include <regex>
#include <string>
#include <vector>

/* Tokenize the given vector according to the regex
   and remove the empty tokens. */

std::vector<std::string> tokenize(
                     const std::string str,
                          const std::regex re)
{
    std::sregex_token_iterator it{ str.begin(),
                             str.end(), re, -1 };
    std::vector<std::string> tokenized{ it, {} };

    // Additional check to remove empty strings
    tokenized.erase(
        std::remove_if(tokenized.begin(),
                            tokenized.end(),
                       [](std::string const& s) {
                           return s.size() == 0;
                       }),
        tokenized.end());

    return tokenized;
}

// Driver Code
int main()
{
    const std::string str = "Break string
                   a,spaces,and,commas";
    const std::regex re(R"([\s|,]+)");

    // Function Call
    const std::vector<std::string> tokenized =
                           tokenize(str, re);

    for (std::string token : tokenized)
        std::cout << token << std::endl;
    return 0;
}

Lexing*

A lexer (or lexical analyzer) is a program that takes a stream of raw input characters, such as source code, and breaks it down into a sequence of meaningful units called tokens. These tokens represent keywords, operators, identifiers, numbers, and other linguistic components, which are then passed to a parser for syntactic analysis.


How it works:

  1. Input:: The lexer receives a string of characters (e.g., int x = 1;) as input.
  2. Tokenization:: It scans the input, identifying patterns and classifying sequences of characters into specific token categories.
  3. Output:: The output is a stream of tokens, where each token typically includes its type (like "keyword," "identifier," or "operator") and its actual value. For example, int x = 1; would be broken into tokens: int (keyword), x (identifier), = (operator), 1 (number), and ; (punctuation).
  4. Discarding Whitespace:: A lexer also typically ignores whitespace and comments, which are not considered part of the core language structure.

Purpose in Compilers:

The lexer is the first phase in the compilation process. By converting the input characters into a more manageable and structured stream of tokens, it simplifies the task of the parser, which handles the grammatical correctness of the code.

Analogy:

Think of a lexer as the process of breaking down a sentence into individual words and punctuation marks (tokens). The parser would then take these words and understand the overall structure and meaning of the sentence.

(From AI Overview, by Google)