JavaScript Strings

...

UTF-16

JavaScript strings are encoded as a sequence of 16-bit numbers. These are called code units. A Unicode character code was initially supposed to fit within such a unit (which gives you a little over 65,000 characters). When it became clear that wasn't going to be enough, many people balked at the need to use more memory per character. To address these concerns, UTF-16, the format also used by JavaScript strings, was invented. It describes most common characters using a single 16-bit code unit but uses a pair of two such units for others.


UTF-16 is generally considered a bad idea today. It seems almost intentionally designed to invite mistakes. It's easy to write programs that pretend code units and characters are the same thing. And if your language doesn't use two-unit characters, that will appear to work just fine. But as soon as someone tries to use such a program with some less common Chinese characters, it breaks. Fortunately, with the advent of emoji, everybody has started using two-unit characters, and the burden of dealing with such problems is more fairly distributed.

Surrogate Pairs

As said, the entire Unicode character set is much, much bigger than 65536. The extra characters are stored in UTF-16 as surrogate pairs, which are pairs of 16-bit code units that represent a single character. To avoid ambiguity, the two parts of the pair must be between 0xD800 and 0xDFFF, and these code units are not used to encode single-code-unit characters. (More precisely, leading surrogates, also called high-surrogate code units, have values between 0xD800 and 0xDBFF, inclusive, while trailing surrogates, also called low-surrogate code units, have values between 0xDC00 and 0xDFFF, inclusive.) Each Unicode character, comprised of one or two UTF-16 code units, is also called a Unicode code point. Each Unicode code point can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits.

Lone Surrogates

A lone surrogate is a 16-bit code unit satisfying one of the descriptions below:

  • It is in the range 0xD800–0xDBFF, inclusive (i.e., is a leading surrogate), but it is the last code unit in the string, or the next code unit is not a trailing surrogate.
  • It is in the range 0xDC00–0xDFFF, inclusive (i.e., is a trailing surrogate), but it is the first code unit in the string, or the previous code unit is not a leading surrogate.

Lone surrogates do not represent any Unicode character. Although most JavaScript built-in methods handle them correctly because they all work based on UTF-16 code units, lone surrogates are often not valid values when interacting with other systems — for example, encodeURI() will throw a URIError for lone surrogates, because URI encoding uses UTF-8 encoding, which does not have any encoding for lone surrogates. Strings not containing any lone surrogates are called well-formed strings, and are safe to be used with functions that do not deal with UTF-16 (such as encodeURI() or TextEncoder). You can check if a string is well-formed with the isWellFormed() method, or sanitize lone surrogates with the toWellFormed() method.

Accessing Characters with Old String.charCodeAt, New codePointAt...

JavaScript's charCodeAt method gives you a code unit, not a full character code. The codePointAt method, added later, does give a full Unicode character, so we could use that to get characters from a string. But the argument passed to codePointAt is still an index into the sequence of code units. To run over all characters in a string, we'd still need to deal with the question of whether a character takes up one or two code units.

toString()

When you call the String(OBJ) function (which converts a value to a string) on an object, it will call the toString() method on that object to try to create a meaningful string from it.

Some of the standard prototypes define their own version of toString so they can create a string that contains more useful information than [object Object]. You can also do that yourself.

Rabbit.prototype.toString = function() {
return `a ${this.type} rabbit`;
};

which you can use as follows:

console.log(String(killerRabbit));
// → a killer rabbit


        


        


        


      

Template Literals/Strings

Template Strings use back-ticks (``) rather than the quotes ("") to define a string:

let text = `Hello World!`;

Features:

Template Strings allow both single and double unscaped quotes inside a string:

Example:

let text = `He's often called "Johnny"`;
Multiline Strings

Template Strings allow multiline strings

let text =
`The quick
brown fox
jumps over
the lazy dog`;
Interpolation

Template Strings allow variables in strings. They provide an easy way to interpolate variables in strings.

Syntax

${...}

Example. Variable Substitutions:

let firstName = "John";
let lastName = "Doe";

let text = `Welcome ${firstName}, ${lastName}!`;
Expression Substitution

Template Strings allow interpolation of expressions in strings.

Example:

let price = 10;
let VAT = 0.25;

let total = `Total: ${(price * (1 + VAT)).toFixed(2)}`;
HTML Templates

Example:

let header = "Template Strings";
let tags = ["template strings", "javascript", "es6"];

let html = `<h2>${header}</h2><ul>`;
for (const x of tags) {
  html += `<li>${x}</li>`;
}

html += `</ul>`;

Replacing String Pieces with Other Pieces with replace(STARTSTRING,ENDSTRING) and replaceAll(STARTSTRING,ENDSTRING)

The replace() method searches a string for a value or a regular expression and returns a new string with the value(s) replaced. It does not change the original string.

let text = "Visit Microsoft!";
let result = text.replace("Microsoft", "W3Schools");

Or with a regexp capture:

let text = "Mr Blue has a blue house and a blue car";
let result = text.replace(/blue/g, "red");

Note If you replace a value, only the first instance will be replaced. To replace all instances, use a regular expression with the g modifier set.


The replaceAll() method searches a string for a value or a regular expression. A new string with all values replaced is returned, while the original string is not changed. (Introduced in JavaScript 2021.)

You can replace all occurrances of a given substring:

text = text.replaceAll("Cats","Dogs");
text = text.replaceAll("cats","dogs");

You can also replace all occurrances of a regexp capture:

text = text.replaceAll(/Cats/g,"Dogs");
text = text.replaceAll(/cats/g,"dogs");

Note If the parameter is a regular expression, the global flag (g) must be set, otherwise a TypeError is thrown.

STRING.search(SUBSTRING/REGEX) and STRING.match(SUBSTRING/REGEX)

The search() method matches a string against a substring or a regular expression and returns the index (position) of the first match, or -1 if no match is found. (The search() method is case sensitive.)

An example with a string argument:

let text = "Mr. Blue has a blue house";
let position = text.search("blue");

An example with a regexp argument:

let text = "Mr. Blue has a blue house";
let position = text.search(/Blue/);

Escape Characters

The backslash escape character (\) turns special characters into string characters:

Code Result Description
\' ' Single quote
\" " Double quote
\\ \ Backslash
\0 Null character (U+0000 NULL) (only if the next character is not a decimal digit; else it’s an octal escape sequence)
\b Backspace
\f Form Feed
\n New Line
\r Carriage Return
\t Horizontal Tabulator
\v Vertical Tabulator

Note The last 6 escape characters above were originally designed to control typewriters, teletypes, and fax machines. They do not make any sense in HTML.

Hexadecimal escape sequences

Any character with a character code lower than 256 (i.e. any character in the extended ASCII range) can be escaped using its hex-encoded character code, prefixed with \x. (Note that this is the same range of characters that can be escaped through octal escapes.)

Hexadecimal escapes are four characters long. They require exactly two characters following \x. If the hexadecimal character code is only one character long (this is the case for all character codes smaller than 16, or 10 in hex), you’ll need to pad it with a leading 0.

For example, the copyright symbol ('©') has character code 169, which gives A9 in hex, so you could write it as \xA9.

The hexadecimal part of this escape is case-insensitive; in other words, \xa9 and \xA9 are equivalent.

You could define hexadecimal escape syntax using the following regular expression: \\x[a-fA-F0-9]{2}.

Admittedly, it’s a bit confusing that the spec refers to this kind of escape sequence as hexadecimal, since Unicode escapes use hex as well.

Unicode escape sequences

Any character with a character code lower than 65536 can be escaped using the hexadecimal value of its character code, prefixed with \u. (As mentioned before, higher character codes are represented by a pair of surrogate characters.)

Unicode escapes are six characters long. They require exactly four characters following \u. If the hexadecimal character code is only one, two or three characters long, you’ll need to pad it with leading zeroes.

The copyright symbol ('©') has character code 169, which gives A9 in hexadecimal notation, so you could write it as '\u00A9'. Similarly, '♥' could be written as '\u2665'.

The hexadecimal part of this kind of character escape is case-insensitive; in other words, \u00a9 and \u00A9 are equivalent.

You could define Unicode escape syntax using the following regular expression: \\u[a-fA-F0-9]{4}.

Note: Other than a few simple escapes, Unicode escapes are the only ones allowed by the JSON specification.

ECMAScript 6: Unicode code point escapes

ECMAScript 6 introduces a new kind of escape sequence in strings, namely Unicode code point escapes. Additionally, it will define String.fromCodePoint and String#codePointAt, both of which accept code points rather than UCS-2/UTF-16-like code units.

When this is implemented, any character can be escaped using the hexadecimal value of its character code, prefixed with \u{ and suffixed with }. This is allowed for code points up to 0x10FFFF, which is the highest code point defined by Unicode.

Unicode code point escapes consist of at least five characters. At least one hexadecimal character can be wrapped in \u{…}. There is no upper limit on the number of hex digits in use (for example '\u{000000000061}' == 'a') but for practical purposes you won’t need more than 6, unless you perform unnecessary zero-padding.

The tetragram for centre symbol (𝌆) has code point U+1D306, so you could write it as \u{1D306}. For comparison, if you were to use simple Unicode escapes to represent this symbol, you’d have to write out the surrogate halves separately: '\uD834\uDF06'.

The hexadecimal part of this kind of character escape is case-insensitive; in other words, '\u{1d306}' and '\u{1D306}' are equivalent.

You could define Unicode code point escape syntax using the following regular expression: \\u\{([0-9a-fA-F]{1,})\}.