UTF-8 Encoding

8-bit Unicode Transformation format, called UTF-8, is a variable width character encoding that can encode all of the 1.111.064 valid code points in Unicode wit one to four 8-bit bytes. The number "8" means 8-bit blocks are used by UTF for representing a character.

Since 2009, UTF-8 has been the leading encoding for the World Wide Web.

For characters that are equal to or below 127 (hex 0x7F), the UTF-8 representation is one byte. This is similar to the ASCII value.

For any character equal to or below 2047 (hex 0x07FF), the UTF-8 representation is scattered over two bytes.

For any character that is equal to or greater than 2048 but less than 65535 (0xFFFF), the UTF-8 representation will be spread across three bytes.

Here is a list of some of the UTF-8 character codes which are supported by HTML5:

Character Codes Decimal Hexadecimal
C0 Controls and Basic Latin 0-127 0000-007F
C1 Controls and Latin-1 Supplement 128-255 0080-00FF
Latin Extended-A 256-383 0100-017F
Latin Extended-B 384-591 0180-024F
Spacing Modifiers 688-767 02B0-02FF
Diacritical Marks 768-879 0300-036F
Greek and Coptic 880-1023 0370-03FF
Cyrillic Basic 1024-1279 0400-04FF
Cyrillic Supplement 1280-1327 0500-052F
General Punctuation 8192-8303 2000-206F
Currency Symbols 8352-8399 20A0-20CF
Letterlike Symbols 8448-8527 2100-214F
Arrows 8592-8703 2190-21FF
Mathmetical Operators 8704-8959 2200-22FF
Box Drawings 9472-9599 2500-257F
Block Elements 9600-9631 2580-259F
Geometric Shapes 9632-9727 25A0-25FF
Miscellaneous Symbols 9728-9983 2600-26FF
Dingbats 9984-10175 2700-27BF



Do you find this helpful?

Related articles