Unicode
Contents
A little background, or, How did we reach here
In chapter 5 of Real World Haskell, the implementation is to create a JSON library. Everything goes fine until the authors start with something called “Astral” representation of characters. The general consensus of the readers is that the authors have gone from 0 to 120 in like 2-3 code blocks (“The author shows his muscles in this chapter. Not funny at all.” someone concluded in this funny statement), and truth be told I was one of the people who had this idea.
Shielded away from the guts of how the encodings really work, because… libraries, I realised I had to take a detour in order to understand what exactly was going on in those code blocks in order to implement a library myself.
So the problem arose because JSON (and JavaScript till ES5) does not uses the now ubiquitous UTF-8 encoding to… encode characters, but something called UCS-2, which is a 2byte (16bits) fixed length encoding, which basically means that the characters are encoded in a uniform fashion and therefore the encoding can only show \(2^{16}\) or 65536 characters. If you include the characters we are using these days (Emojis 😉 and symbols âš”), then 65536 is too low a number.
Enter Unicode
Unicode is basically a mapping between a number called “code point” and
the character, e.g. U+20AC
is the code point for the Euro Symbol . The
Unicode standard provides 1,112,064 valid code points, and values from
U+0000
to U+10FFFF
, which encode all the symbols that we have around
from languages, music, maths, sciences, etc. while still leaving enough
space for more symbols that we might make up. It is because of this huge
number that we had hit a wall while using the fixed length encoding
UCS-2.
The standard divides the characters in 17 planes, each with 65534 characters. The first plane, Plane 0, is called Basic Multilingual Plane (BMP) which “contains characters for almost all modern languages, and a large number of symbols.“1. The others above it are (jokingly) called Astral Planes, and like our universe, most of these are pretty empty.
UCS-2 can show all the characters of BMP. But not the symbols from the Astrals.
The standard provides 3 ways of encoding all the characters: UTF-8, UTF-16, and UTF-32. Out of these, the first two are “variable length” and the last is fixed length (like UCS-2).
UTF-32
We can quickly end the discussion of UTF-32 by stating that each character in this encoding takes 32bits. Therefore, we have the option of displaying \(2^{32}\) characters, which is a lot more than the number of code points we really have.
The remaining two are more interesting encodings. Let’s go over them:
UTF-8
This is a variable length encoding, and 8 in the name signifies that we use 1, 2, 3, or 4 8bit bytes to encode the characters. Therefore the size of each character may range from 1byte to 4bytes. Each character has a leading bit, which is the first bit of the encoding.
0 Leading Bit
0
as the leading bit means that the encoding is storing an ASCII
character: between U+0000 to U+007F. This takes a single byte of the
form: 0xxxxxxx
. So we have 7bits for the data, enough to show \(2^7\) or
\(127\) characters in ASCII, which makes UTF-8 backwards compatible, and
is one of the reasons of its rise.
1 Leading Bit
1
as the leading bit means we are using variable number of bytes, and
the number of =1=s denote the number of bytes being used for encoding
the current character.
Since we are already using 1byte for ASCII characters, we can only use 2, 3, or 4 bytes for the next set of characters.
2 Bytes
This encodes the characters between U+0080 to U+07FF and is of the the
form: 110xxxxx 10xxxxxx
. Therefore, we have 11bits for encoding the
characters for the given range. Notice that the form has two leading 1
and total of 2 bytes.
3 Bytes
This encodes the characters between U+0800 to U+FFFF and is of the the
form: 1110xxxx 10xxxxxx 10xxxxxx
. Therefore, we have 16bits for
encoding the characters for the given range. Again, the form has three
leading 1
and total of 3 bytes.
4 Bytes
This encodes the characters between U+10000 to U+1FFFFF and is of the
the form: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
. Therefore, we have
21bits for encoding the characters for the given range. Four leading 1
and total of 4 bytes.
As an example, this is how the wink character 😉 (U+1F609) is encoded:
- The code point is 1F609, which states that this will use 4 bytes.
- Convert this to binary, we get:
11111011000001001
which has a length of 17. - Since we have 21bits to show data, we prepend 4 zeros in the front to
make it 21bits:
000011111011000001001
- We put the bits in the template for 4byte encoding:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
where each x is replaced by the corresponding bit of binary000011111011000001001
- End result (with spaces added for clarity) is:
11110 000:10 011111:10 011000:10 001001
UTF-16
This is the second variable length encoding option of the Unicode standard, and as the name states, it uses 16bit words (either 1 or 2) to encode the character.
For the first 65536 characters (U+0000
to U+D7FF
and U+E000
to
U+FFFF
), which are enough to show the characters from BMP we use the
first byte. In this regard, this is exactly the same as UCS-2. Had the
story ended here, we would not have that astral code block in the book.
But it turns out that now days we are interested in a lot more symbols âš”
and Emojis 😉. Therefore we need to show the characters which have the
values between U+10000
to U+10FFFF
.
If you notice above, we have a gap between U+D800
to U+DFFF
: a gap
of 2 bytes. This gap is used to create surrogate pairs, where the
first byte (values: U+D800
to U+DBFF
) and second byte (values:
U+DCOO
to U+DFFF
) are called High and Low surrogate respectively.
The unique combination of two bytes allows us to have 1024 x 1024 =
1,048,567 more values. Good enough.
The thing about surrogate pairs is that the standard forbids the use of
these values for anything other than encoding the values between
U+10000
to U+10FFFF
so we can be assured that there will not be any
clashes.
The algorithm for converting these values is:
- 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
- The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
- The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
Once this theory is clear, I came to understand the following astral function in the book:
|
|
Phew! A trip down the Rabbit Hole.
Author Tushar Tyagi
LastMod Apr 11, 2016