JavaScript Strings

...

UTF-16

JavaScript strings are encoded as a sequence of 16-bit numbers. These are called code units. A Unicode character code was initially supposed to fit within such a unit (which gives you a little over 65,000 characters). When it became clear that wasn't going to be enough, many people balked at the need to use more memory per character. To address these concerns, UTF-16, the format also used by JavaScript strings, was invented. It describes most common characters using a single 16-bit code unit but uses a pair of two such units for others.

UTF-16 is generally considered a bad idea today. It seems almost intentionally designed to invite mistakes. It's easy to write programs that pretend code units and characters are the same thing. And if your language doesn't use two-unit characters, that will appear to work just fine. But as soon as someone tries to use such a program with some less common Chinese characters, it breaks. Fortunately, with the advent of emoji, everybody has started using two-unit characters, and the burden of dealing with such problems is more fairly distributed.

Surrogate Pairs

As said, the entire Unicode character set is much, much bigger than 65536. The extra characters are stored in UTF-16 as surrogate pairs, which are pairs of 16-bit code units that represent a single character. To avoid ambiguity, the two parts of the pair must be between 0xD800 and 0xDFFF, and these code units are not used to encode single-code-unit characters. (More precisely, leading surrogates, also called high-surrogate code units, have values between 0xD800 and 0xDBFF, inclusive, while trailing surrogates, also called low-surrogate code units, have values between 0xDC00 and 0xDFFF, inclusive.) Each Unicode character, comprised of one or two UTF-16 code units, is also called a Unicode code point. Each Unicode code point can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits.

Lone Surrogates

A lone surrogate is a 16-bit code unit satisfying one of the descriptions below:

It is in the range 0xD800–0xDBFF, inclusive (i.e., is a leading surrogate), but it is the last code unit in the string, or the next code unit is not a trailing surrogate.
It is in the range 0xDC00–0xDFFF, inclusive (i.e., is a trailing surrogate), but it is the first code unit in the string, or the previous code unit is not a leading surrogate.

Lone surrogates do not represent any Unicode character. Although most JavaScript built-in methods handle them correctly because they all work based on UTF-16 code units, lone surrogates are often not valid values when interacting with other systems — for example, encodeURI() will throw a URIError for lone surrogates, because URI encoding uses UTF-8 encoding, which does not have any encoding for lone surrogates. Strings not containing any lone surrogates are called well-formed strings, and are safe to be used with functions that do not deal with UTF-16 (such as encodeURI() or TextEncoder). You can check if a string is well-formed with the isWellFormed() method, or sanitize lone surrogates with the toWellFormed() method.

Accessing Characters with Old `String.charCodeAt`, New `codePointAt`...

JavaScript's charCodeAt method gives you a code unit, not a full character code. The codePointAt method, added later, does give a full Unicode character, so we could use that to get characters from a string. But the argument passed to codePointAt is still an index into the sequence of code units. To run over all characters in a string, we'd still need to deal with the question of whether a character takes up one or two code units.

`toString()`

When you call the String(OBJ) function (which converts a value to a string) on an object, it will call the toString() method on that object to try to create a meaningful string from it.

Some of the standard prototypes define their own version of toString so they can create a string that contains more useful information than [object Object]. You can also do that yourself.

Rabbit.prototype.toString = function() {
return `a ${this.type} rabbit`;
};

which you can use as follows:

console.log(String(killerRabbit));
// → a killer rabbit

Template Literals/Strings

Template Strings use back-ticks (``) rather than the quotes ("") to define a string:

let text = `Hello World!`;

Features:

Template Strings allow both single and double unscaped quotes inside a string:

Example:

let text = `He's often called "Johnny"`;

Multiline Strings

Template Strings allow multiline strings

let text =
`The quick
brown fox
jumps over
the lazy dog`;

Interpolation

Template Strings allow variables in strings. They provide an easy way to interpolate variables in strings.

Syntax

${...}

Example. Variable Substitutions:

let firstName = "John";
let lastName = "Doe";

let text = `Welcome ${firstName}, ${lastName}!`;

Expression Substitution

Template Strings allow interpolation of expressions in strings.

Example:

let price = 10;
let VAT = 0.25;

let total = `Total: ${(price * (1 + VAT)).toFixed(2)}`;

HTML Templates

Example:

let header = "Template Strings";
let tags = ["template strings", "javascript", "es6"];

let html = `<h2>${header}</h2><ul>`;
for (const x of tags) {
  html += `<li>${x}</li>`;
}

html += `</ul>`;

Replacing String Pieces with Other Pieces with `replace(STARTSTRING,ENDSTRING)` and `replaceAll(STARTSTRING,ENDSTRING)`

The replace() method searches a string for a value or a regular expression and returns a new string with the value(s) replaced. It does not change the original string.

let text = "Visit Microsoft!";
let result = text.replace("Microsoft", "W3Schools");

Or with a regexp capture:

let text = "Mr Blue has a blue house and a blue car";
let result = text.replace(/blue/g, "red");

Note If you replace a value, only the first instance will be replaced. To replace all instances, use a regular expression with the g modifier set.

The replaceAll() method searches a string for a value or a regular expression. A new string with all values replaced is returned, while the original string is not changed. (Introduced in JavaScript 2021.)

You can replace all occurrances of a given substring:

text = text.replaceAll("Cats","Dogs");
text = text.replaceAll("cats","dogs");

You can also replace all occurrances of a regexp capture:

text = text.replaceAll(/Cats/g,"Dogs");
text = text.replaceAll(/cats/g,"dogs");

Note If the parameter is a regular expression, the global flag (g) must be set, otherwise a TypeError is thrown.

`STRING.search(SUBSTRING/REGEX)` and `STRING.match(SUBSTRING/REGEX)`

The search() method matches a string against a substring or a regular expression and returns the index (position) of the first match, or -1 if no match is found. (The search() method is case sensitive.)

An example with a string argument:

let text = "Mr. Blue has a blue house";
let position = text.search("blue");

An example with a regexp argument:

let text = "Mr. Blue has a blue house";
let position = text.search(/Blue/);

JavaScript Strings

UTF-16

Surrogate Pairs

Lone Surrogates

Accessing Characters with Old String.charCodeAt, New codePointAt...

toString()

Template Literals/Strings

Replacing String Pieces with Other Pieces with replace(STARTSTRING,ENDSTRING) and replaceAll(STARTSTRING,ENDSTRING)

STRING.search(SUBSTRING/REGEX) and STRING.match(SUBSTRING/REGEX)

Accessing Characters with Old `String.charCodeAt`, New `codePointAt`...

`toString()`

Replacing String Pieces with Other Pieces with `replace(STARTSTRING,ENDSTRING)` and `replaceAll(STARTSTRING,ENDSTRING)`

`STRING.search(SUBSTRING/REGEX)` and `STRING.match(SUBSTRING/REGEX)`