During my years as an engineering leader, I've worked with developers who do not have a mental model of Unicode beyond "they're special characters". This sometimes causes them to create and run into difficulty solving bugs due to confusion between bytes, UTF-8, and Unicode code points.
Rather than thinking of Unicode as "special characters," it may help some developers to think about Unicode as a number system created by an alien race . . . with a lot of fingers.
Number systems have a radix
The most familiar radix is "base 10", the plain old "normal" numbers used around the world today. The 10 corresponds to the ten possible digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and probably has its origins in the ten anatomical digits on our hands.
Most people have also heard that modern computers are based on binary or "base 2", where there are only 2 possible digits: 0, 1
- hexidecimal (base 16) with digits: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F
- Base 64 with digits: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, /
Computers know about "bytes"
Even though computers are based on binary at the lowest levels, most computers from the 1960s until the present (2010s) understand bytes. Bytes are often thought of as 8 bits, to the extent that before they were called bytes, they were called octets.
Since a byte can be thought of as a base 2 number, one thing we can do is convert it to a different radix. Converting the above byte into base 10 gives the three digit number
109. Converting it into base 16 gives the two digit number
6D. Converting it into Base64 gives the two digit number
bQ. These are all the same number represented in different ways.
A byte is a base 256 digit
Why stop at 64? We could also convert a byte into a base 256 number.
The total number of possible bytes is 28 = 256.
The total number of digits in a base 256 number system is 256.
A byte will always be a single base 256 digit, because there are exactly as many digits as there are different bytes. But we can't keep using letters and numbers and punctuation, because there aren't 256 of them. After we exhaust those, we need to draw symbols for digits from other alphabets, or just make up little pictures for them like snowmen, hearts and telephones:
|Base 10||Our Base 256||Base 10||Our Base 256||Base 10||Our Base 256||Base 10||Our Base 256|
Using these digits, converting our byte
01101101 (base 2) into base 256 gives the one digit number
γ (base 256).
Maybe it doesn't look like a digit because we don't normally think in base 256, but it's a completely valid one digit number.
A sequence of bytes is a base 256 number
If a single byte is one base 256 digit, what would a larger number look like?
It would look like multiple bytes:
These three bytes
001000101111111111111100 (base 2), is the same as the three digit number
i♣☃ (base 256), or the seven digit number
2293756 (base 10).
ASCII is Base 128
There's another ubiquitous radix in computers -- base 128. The standard set of "digits" for this base is defined by a standard called ASCII, but unfortunately for us humans, only 94 out of the 128 digits are actually possible to write down on a piece of paper, so we can't use ASCII to write every number in base 128.
However, we can write some of the numbers and that turns out to be pretty useful.
As you read this, your computer is likely doing something similar to interpreting this sentence as a 130 digit (base 128) number.
Let's take a few examples:
- The two digit number
f1(base 128) is the same as the 5 digit number
13105(base 10) or the 14 digit number
- The seven digit number
wtanaka(base 128) is the same as the 15 digit number
527379535001057(base 10) or the 49 digit number
- The two digit number
51(base 128) is the same as the 4 digit number
6833(base 10) or the 13 digit number
527379535001057 (base 10) is not meaningful for humans, when we convert it into base 128 and use ASCII for digits, we get "
wtanaka" (base 128) which looks suspiciously like text.
Character encodings are ways to represent any radix using bytes
Ultimately, computers deal most efficiently with bytes, so if we have numbers in a radix other than base 256, we need to figure out how to store those numbers using bytes.
To store our ASCII number
wtanaka (base 128) in bytes, one approach is to take the 49 digit
1110111111010011000011101110110000111010111100001 (base 2) and split it into 6 groups of 8 and 1 group of 1, like this:
for a total of 7 bytes. However, the approach that computers take is to waste the first 1/8 of each byte by setting it to 0 like this in order to have each byte correspond to a single base 128 "digit":
w t a n a k a
Each of these base 128 digits is commonly known as a character, and these different strategies for shoving representing a non base-256 number using bytes are examples of different character encodings.
Unicode is Base 1114112
ASCII has 128 different characters (some of which are not printable) for each of its 128 digits. However, there are more than 128 different characters in all of the languages of the world, so to extend the approach of equating characters to digits, we need many many many more digits—i.e. a much higher radix. Unicode has, as of this writing, settled on a radix of base 114112. Just like with other lower radices, there is a table with all of the digits in it:
|base 10||Unicode aka base 1114112|
Unicode has decided to call these base 1114112 digits "code points" instead of characters because, like with ASCII, many of them are not printable. We'll just continue calling them Unicode characters for the rest of this post. Just like with ASCII, in order to store a Unicode character in a computer, we need to shove it into figure out how to represent it using bytes.
UTF-32BE character encoding
Luckily, there are several standard character encodings for this. The simplest is called UTF-32BE which uses 4 bytes to store a single character. For example, to encode the 7 Unicode characters
wtanaka, we look them up in the large table of characters:
|Unicode (base 1114112)||base 2|
and represent each character with 4 bytes:
UTF-8 character encoding
UTF-32 wastes a lot of space, so by far the most commonly used encoding is a more complicated one called UTF-8. Not by coincidence, for ASCII characters, the representation ends up being identical to the ASCII representation:
For larger characters, UTF-8 uses either 2, 3 or 4 bytes to represent a single character. For example:
|base 10||Unicode aka base 1114112||base 2|