Tokenization: Breaking Text into its Component Words
[...]
Unix Tools for Crude Tokenization and Normalization
Let's begin with an easy, if somewhat naive version of word tokenization and normalization (and frequency computation) that can be accomplished for English solely in a single UNIX command-line, inspired by Church (1994). We'll make use of some Unix commands: tr, used to systematically change particular characters in the input; sort, which sorts input lines in alphabetical order; and uniq, which collapses and counts adjacent identical lines.
For example let's begin with the complete words of Shakespeare in one file, sh.txt. We can use tr to tokenize the words by changing every sequence of non-alphabetic characters to a newline (A-Za-z
means alphabetic, the -c
option complements to non-alphabet, and the -s
option squeezes all sequences into a single character):
tr -sc 'A-Za-z' '\n' < sh.txt