Regular Expressions: A Language of Text Patterns

Regular expressions are a miniature language used for matching complex text patterns and thus transforming text. They are a powerful tool that only require a small time investment to learn. They are supported by most modern programming languages and by most text editors and text processors.

A regular expression pattern is a sequence of characters representing what you want to match in a given string. Any character in the regular expression matches itself except for some special characters.

The Uses of Regular Expressions

Regular expressions can be used for several string-related operations:

Validation: Check if an input string is well-formed.

For example: Is the input string a well-formed phone number?
Decision: Check what kind of string an input represents.

For example: Is the input string the name of a JPEG or a PNG file?
Parsing: Extract information from an input string.

For example: From a full filename, extract the filename part without the full path and without its extension.
Transformation: Search sub-strings and replace them with a new formatted sub-string.

For example: Search all occurrences of "C++14" and replace them with "C++".
Iteration: Search all occurrences of a sub-string.

For example: Extract all phone numbers from an input string.
Tokenization: Split a string into sub-strings based on a set of delimiters.

For example: Split a string on whitespace, commas, periods, and so on to extract its individual words.

Regular Expression Components and Terminology

Before we can go into more detail on the regular expressions, there is some important terminology to know. The following terms are used throughout the discussion:

Pattern The actual regular expression is a pattern represented by a string.
Match Determines whether there is a match between a given regular expression and all of the characters in a given sequence [first,last).
Search Determines whether there is some sub-string within a given sequence [first,last) that matches a given regular expression.
Replace: Identifies sub-strings in a given sequence, and replaces them with a corresponding new sub-string computed from another pattern, called a substitution pattern.

Character Classes and Bracket Expressions

A bracket expression is a list of characters enclosed by [ and ]. It matches any single character in that list. If the first character of the list is the caret ^ then it matches any character not in the list; it is unspecified whether it matches an encoding error. For example, the regular expression [0123456789] matches any single digit.

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd].

Many locales sort characters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C.

Finally, certain named classes of characters are predefined within bracket expressions, as follows. Their names are self explanatory, and they are [:alnum:], [:alpha:], [:blank:], [:cntrl:], [:digit:], [:graph:], [:lower:], [:print:], [:punct:], [:space:], [:upper:], and [:xdigit:]. For example, [[:alnum:]] means the character class of numbers and letters in the current locale. In the C locale and ASCII character set encoding, this is the same as [0-9A-Za-z]. (Note that the brackets in these class names are part of the symbolic names, and must be included in addition to the brackets delimiting the bracket expression.) Most meta-characters lose their special meaning inside bracket expressions. To include a literal ] place it first in the list. Similarly, to include a literal ^ place it anywhere but first. Finally, to include a literal - place it last.

Anchoring

The caret ^ and the dollar sign $ are meta-characters that respectively match the empty string at the beginning and end of a line.

The Backslash Character and Special Expressions

The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string provided it's not at the edge of a word. The symbol \w is a synonym for [_[:alnum:]] and \W is a synonym for [^_[:alnum:]].

Repetition

A regular expression may be followed by one of several repetition operators:

`?`	The preceding item is optional and matched at most once.
`*`	The preceding item will be matched zero or more times.
`+`	The preceding item will be matched one or more times.
`{n}`	The preceding item is matched exactly n times.
`{n,}`	The preceding item is matched n or more times.
`{,m}`	The preceding item is matched at most m times. This is a GNU extension.
`{n,m}`	The preceding item is matched at least n times, but not more than m times.

Greedy versus Lazy Quantifiers

The distinction between greedy and lazy quantifiers concerns how much of the input string is consumed during a match. By default, quantifiers are greedy, meaning they match as many occurrences of the preceding element as possible. For instance, given the pattern a.*b applied to the string aabab, a greedy quantifier may consume more characters than intended, potentially causing the match to extend unexpectedly. Lazy quantifiers, on the other hand, attempt to match the fewest number of characters necessary to satisfy the pattern, typically indicated by appending a question mark to the quantifier itself. For example, a.*?b modifies the pattern to stop at the first occurrence of b.

Understanding the difference between greedy and lazy quantifiers is crucial, as it directly impacts the behavior of the regex when multiple potential matches are available.

Capture Groups

A capture group allows further analyzing the search result in a regular expression. They are defined by a pair of parentheses ( ). The regular expression ((a+)(b+)(c+)) has four capture groups: ((a+)(b+)(c+)), (a+), (b+) and (c+) The total result is the 0-th capture group.

Some Common/Useful Regular Expressions

Note Often enough you will want to append switch i for ignore case...

Words

Either (no hyphens)

/\b[\w]\b/

or (hyphens allowed)

/\b[\w-]+\b/

E-mail Addresses

/[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/