Unicode: flag "u" and class \p{...}

JavaScript applies Unicode encoding for strings. The majority of the characters are encoded with two bytes, but it allows representing at most 65536 characters.

However, the range is not large enough for encoding all the possible characters. For that reason, several rare characters are encoded with four bytes. The Unicode values for some characters are represented below:

Character Unicode Bytes count in unicode
a 0x0061 2
0x2248 2
Ç 0x00E7 2
Ģ 0x0122 2

So, the characters such as a and occupy two bytes. The codes for Ç,Ģ are longer, having four bytes.

Previously, when JavaScript was newly created, Unicode encoding was more straightforward: no four-byte characters existed. That’s why some language features now can’t handle them correctly.

For example, the length accepts it as two characters, as shown here:

console.log('Ç'.length); // 1
console.log('Ģ'.length); // 1

Regular expressions also treat the four-byte characters as a pair of two-byte ones.

But, in contrast to strings, regular expressions have the u flag, which can fix such kind of problems. With it, a regexp can handle the four-byte characters correctly.

Unicode properties \p{…}

Each Unicode character has multiple properties. They specify what category the character is related to, containing essential information about it.

For instance, if a character includes a Letter property, it means that the character is alphabetical. A Number property means that it is a digit: Chinese or Arabic, and more.

It is possible to look for characters with a property, written as \p{…}. For applying \p{…}, a regexp should have the u flag.

For example, \p{Letter} signifies a letter in any language. Also, \p{L} can be used, as L is an alias of Letter. Almost all the properties have short aliases.

Three kinds of letters can be found in the example below:

let str = "A ბ ㄱ";
console.log(str.match(/\p{L}/gu)); // A,ბ,ㄱ
console.log(str.match(/\p{L}/g)); // null (no matches, as there isn’t any flag "u")

The main character categories with their subcategories are demonstrated below:

Number N:

  • decimal digit Nd,
  • letter number Nl,
  • different No.

Letter L:

  • lowercase Ll
  • modifier Lm,
  • titlecase Lt,
  • uppercase Lu,
  • other Lo.

Mark M (accents etc):

  • spacing combining Mc,
  • enclosing Me,
  • non-spacing Mn.

Punctuation P:

  • connector Pc,
  • dash Pd,
  • initial quote Pi,
  • final quote Pf,
  • open Ps,
  • close Pe,
  • other Po.

Separator Z:

  • line Zl,
  • paragraph Zp,
  • space Zs.

Symbol S:

  • currency Sc,
  • modifier Sk,
  • math Sm,
  • other So.

Other C:

  • control Cc,
  • format Cf,
  • not assigned Cn, – private use Co,
  • surrogate Cs.

However, many other properties are also supported by Unicode.

Hexadecimal Numbers: example

Now, let’s search for hexadecimal numbers.

So, if you look for hexadecimal numbers, that are written as xFF, where the hex digit is F, the latter will be indicated as \p{Hex_Digit}, like here:

let regexp = /x\p{Hex_Digit}\p{Hex_Digit}/u;
console.log("number: xAF".match(regexp)); // xAF

Chinese hieroglyphs: example

In this part, let’s try to look for Chinese hieroglyphs.

There exists a Script Unicode property, which can have a value like Han (Chinese), Greek, Arabic, and so on.

For searching for characters in a particular writing system, Script=<value> should be used. For instance, \p{sc=Arabic} for Arabic.

In the example below, the Chinese hieroglyphs are searched for:

let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs
let str = `Welcome 你好 123__456`;
console.log(str.match(regexp)); // 你,好

Summary

For strings JavaScript, generally, uses Unicode encoding.

The u flag is used for enabling the support of Unicode in regexp. It means that four-byte characters are handled correctly: as a single character. Also, \p{…} can be used in the search for characters.

So, Unicode properties allow searching for words in a particular language, special characters, and more.




Do you find this helpful?

Related articles