UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.
UTF-8 encodes each character in one to four octets (8-bit bytes):
Four bytes may seem like a lot for one character (code point). However, code points outside the Basic Multilingual Plane are generally very rare. Furthermore, UTF-16 (the main alternative to UTF-8) also needs four bytes for these code points. Whether UTF-8 or UTF-16 is more efficient depends on the range of code points being used. However, the differences between different encoding schemes can become negligible with the use of traditional compression systems like DEFLATE. For short items of text where traditional algorithms do not perform well and size is important, the Standard Compression Scheme for Unicode could be considered instead.
The Internet Engineering Task Force (IETF) requires all Internet protocols to identify the encoding used for character data, and the supported character encodings must include UTF-8.[2] The Internet Mail Consortium (IMC) recommends that all email programs be able to display and create mail using UTF-8.[3]
| Unicode |
|---|
| Character encodings |
| UCS |
| Mapping |
| Bi-directional text |
| BOM |
| Han unification |
| Unicode and HTML |
| Unicode and E-mail |
| Unicode typefaces |
By early 1992 a search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF that provided a byte-stream encoding of its 32-bit characters. This encoding was not satisfactory on performance grounds, but did introduce the notion that bytes in the ASCII range of 0–127 represent themselves in UTF, thereby providing backward compatibility.
In July 1992 the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multibyte sequences would include only 8-bit characters, i.e. those where the high bit was set.
In August 1992 this proposal was circulated by an IBM X/Open representative to interested parties. Ken Thompson of the Plan 9 operating system group at Bell Laboratories then made a crucial modification to the encoding, to allow it to be self-synchronizing, meaning that it was not necessary to read from the beginning of the string in order to find character boundaries. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. The following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open.[4]
UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25–29 1993.
There are several current definitions of UTF-8 in various standards documents:
They supersede the definitions given in the following obsolete works:
They are all the same in their general mechanics with the main differences being on issues such as allowed range of code point values and safe handling of invalid input.
The bits of a Unicode character are divided into several groups which are then divided among the lower bit positions inside the UTF-8 bytes. A character whose code point is below U+0080 is encoded with a single byte that contains its code point: these correspond exactly to the 128 characters of 7-bit ASCII. In other cases, up to four bytes are required. The most significant bit of these bytes is 1, to prevent confusion with 7-bit ASCII characters and therefore keep standard byte-oriented string processing safe.
| Code range hexadecimal |
Scalar value binary |
UTF-8 binary / hexadecimal |
Notes |
|---|---|---|---|
000000–00007F128 codes |
00000000 00000000 0zzzzzzz |
0zzzzzzz |
ASCII equivalence range; byte begins with zero |
seven z |
seven z; byte value 00–7F |
||
000080–0007FF1920 codes |
00000000 00000yyy yyzzzzzz |
110yyyyy 10zzzzzz |
first byte begins with 110, the following byte begins with 10. |
three y; two y, six z |
five y, six z; byte values C2–DF and 80–BF |
||
000800–00D7FF |
00000000 xxxxyyyy yyzzzzzz |
1110xxxx 10yyyyyy 10zzzzzz |
first byte begins with 1110, the following 2 bytes begin with 10. |
four x, four y; two y, six z |
four x, six y, six z; byte values E0–EF and 2x 80–BF |
||
010000–10FFFF1048576 codes |
000wwwxx xxxxyyyy yyzzzzzz |
11110www 10xxxxxx 10yyyyyy 10zzzzzz |
First byte begins with 11110, the following 3 bytes begin with 10 |
three w, two x; four x, four y; two y, six z |
three w; six x; six y; six z; byte values F0–F4 and 3x 80–BF |
For example, the character aleph (א), which is Unicode U+05D0, is encoded into UTF-8 in this way:
0080 to U+07FF. The table shows it will be encoded using two bytes, 110yyyyy 10zzzzzz.05D0 is equivalent to binary 101-1101-0000.y"-s and "z"-s: 11010111 10010000.D7 0x90. That is the encoding of the character aleph (א) in UTF-8.Another example: when the number of bits to be filled is less than the maximum number of free bits available, the high bits are padded with 0's.
For example, the Cent Sign (¢), which is Unicode U+00A2, is encoded into UTF-8 in this way.
0080 to U+07FF. The table shows it will be encoded using two bytes, 110yyyyy 10zzzzzz.00A2 is equivalent to binary 1010-0010.y"-s and "z"-s: 11000010 10100010.C2 0xA2. That is the encoding of the character Cent Sign (¢) in UTF-8.Width by first byte:
| Binary | Hexadecimal | Decimal | Width |
|---|---|---|---|
00000000-01111111 |
00-7F |
0-127 |
1 byte |
11000010-11011111 |
C2-DF |
194-223 |
2 bytes |
11100000-11101111 |
E0-EF |
224-239 |
3 bytes |
11110000-11110100 |
F0-F4 |
240-244 |
4 bytes |
So the first 128 characters (US-ASCII) need one byte. The next 1920 characters need two bytes to encode. This includes Latin alphabet characters with diacritics, Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the BMP characters use three bytes, and additional characters are encoded in four bytes.
By continuing the pattern given above it is possible to deal with much larger numbers. The original specification allowed for sequences of up to six bytes covering numbers up to 31 bits (the original limit of the universal character set). However, UTF-8 was restricted by RFC 3629 to use only the area covered by the formal Unicode definition, U+0000 to U+10FFFF, in November 2003. With these restrictions, the following byte values never appear in a legal UTF-8 sequence:
| Codes (binary) | Codes (hexadecimal) | Notes |
|---|---|---|
1100000x |
C0, C1 |
Overlong encoding: lead byte of a 2-byte sequence, but code point <= 127 |
111101011111011x |
F5, F6, F7 |
Restricted by RFC 3629: lead byte of 4-byte sequence for codepoint above 10FFFF |
111110xx1111110x |
F8, F9, FA, FB, FC, FD |
Restricted by RFC 3629: lead byte of a sequence 5 or 6 bytes long |
1111111x |
FE, FF |
Invalid: lead byte of a sequence 7 or 8 bytes long |
While the two categories labeled "Restricted by RFC" above were technically allowed by earlier UTF-8 specifications, no characters were ever assigned to the code points they represent, so they should never have appeared in UTF-8-encoded text.
Many Windows programs (including Windows Notepad) use the byte sequence EF BB BF at the beginning of a file to indicate that the file is encoded using UTF-8. This is the Byte Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1 characters "" in most text editors and web browsers not prepared to handle UTF-8.
In normal usage, the Java programming language supports standard UTF-8 when reading and writing strings through InputStreamReader and OutputStreamWriter.
However, Java also supports a non-standard variant of UTF-8 called modified UTF-8 for object serialization, for the Java Native Interface, and for embedding constants in class files. There are two differences between modified and standard UTF-8.
The first difference is that the null character (U+0000) is encoded as 0xc0 0x80 rather than 0x00. (0xc0 0x80 is not legal standard UTF-8 because it is not the shortest possible representation.) This guarantees that if an extra null terminator byte 0x00 is placed at the end of the string, it will be the only 0x00 encountered if a string containing embedded null characters is processed in a language such as C using traditional ASCIIZ string functions. In standard UTF-8 the embedded nulls would be encoded as 0x00, signalling the end of the string and causing premature truncation.
The second difference is in the way characters outside the Basic Multilingual Plane are encoded. In standard UTF-8 these characters are encoded using the four-byte format above. In modified UTF-8 these characters are first represented as surrogate pairs (as in UTF-16), and then the surrogate pairs are encoded individually in sequence as in CESU-8, taking up 6 bytes in total. Each Java character represents a 16-bit value. This aspect of the language predates the supplementary planes of Unicode; however, it is important for performance as well as backwards compatibility, and is unlikely to change.
Because modified UTF-8 is not UTF-8, one needs to be very careful to avoid mislabelling data in modified UTF-8 as UTF-8 when interchanging information over the Internet.
Tcl uses the same modified UTF-8 as Java for internal representation of Unicode data.
The Mac OS X Operating System uses canonically decomposed Unicode, encoded using UTF-8 for file names in the filesystem. This is sometimes referred to as UTF-8-MAC. In canonically decomposed Unicode, the use of precomposed characters is forbidden and combining diacritics must be used to replace them.
A common argument[citation needed] is that this makes sorting far simpler, but this argument is easily refuted[citation needed]: for one, sorting is language dependent (in German, the ä character sorts just after the a character, while in Swedish ä sorts after z). Therefore, it can be confusing for software built around the assumption that precomposed characters are the norm and combining diacritics are only used to form unusual combinations. This is an example of the NFD variant of Unicode normalization—most other platforms, including Windows and Linux, use the NFC form of Unicode normalization, which is also used by W3C standards, so NFD data must typically be converted to NFC for use on other platforms or the Web.
This is discussed in Apple Q&A 1173.[5]
Oracle databases use CESU-8. Characters outside the BMP are first encoded as surrogate pairs, which are then each encoded as UTF-8. It is the same as modified UTF-8 from Java, but without the special encoding of the NUL character. It is not valid UTF-8.
As a consequence of the design of UTF-8, the following properties of multi-byte sequences hold:
0.110 for two-byte sequences; 1110 for three-byte sequences, and so on.10 as their two most significant bits.UTF-8 was designed to satisfy these properties in order to guarantee that no byte sequence of one character is contained within a longer byte sequence of another character. This ensures that byte-wise sub-string matching can be applied to search for words or phrases within a text; some older variable-length 8-bit encodings (such as Shift-JIS) did not have this property and thus made string-matching algorithms rather complicated. Although this property adds redundancy to UTF-8–encoded text, the advantages outweigh this concern; besides, data compression is not one of Unicode's aims and must be considered independently. This also means that if one or more complete bytes are lost due to error or corruption, one can resynchronize at the beginning of the next character and thus limit the damage.
Also due to the design of the byte sequences, if a sequence of bytes supposed to represent text validates as UTF-8 then it is fairly safe to assume it is UTF-8. The chance of a random sequence of bytes being valid UTF-8 and not pure ASCII is 3.1% for a 2 byte sequence, 0.39% for a 3 byte sequence and even lower for longer sequences.
While natural languages encoded in traditional encodings are not random byte sequences, they are also unlikely to produce byte sequences that would pass a UTF-8 validity test and then be misinterpreted. For example, for ISO-8859-1 text to be misrecognized as UTF-8, the only non-ASCII characters in it would have to be in sequences starting with either an accented letter or the multiplication symbol and ending with a symbol. Pure ASCII text would pass a UTF-8 validity test and it would be interpreted correctly because the UTF-8 encoding for the same text is the same as the ASCII encoding.
The bit patterns can be used to identify UTF-8 characters. If the byte's first hex code begins with 0–7, it is an ASCII character. If it begins with C or D, it is an 11-bit character (expressed in two bytes). If it begins with E, it is 16-bit (expressed in 3 bytes), and if it begins with F, it is 21 bits (expressed in 4 bytes). 8 through B cannot be first hex codes, but all following bytes must begin with a hex code between 8 through B. Thus, at a glance, it can be seen that "0xA9" is not a valid UTF-8 character, but that "0x54" and "0xE3 0xB4 0xB1" are valid UTF-8 characters.
There is no good validity test for traditional 8-bit encodings like ISO-8859-1. It must be known otherwise which encoding is used, otherwise bad text will be shown. This is called mojibake and other names. The fact that there is a working validity test for UTF-8-encoded texts is a big advantage.
The exact response required of a UTF-8 decoder on invalid input is not uniformly defined by the standards. In general, there are several ways a UTF-8 decoder might behave in the event of an invalid byte sequence:
It is possible for a decoder to behave in different ways for different types of invalid input.
RFC 3629 states that "Implementations of the decoding algorithm MUST protect against decoding invalid sequences."[6] The Unicode Standard requires a Unicode-compliant decoder to "…treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."
Overlong forms are one of the most troublesome types of UTF-8 data. The current RFC says they must not be decoded, but older specifications for UTF-8 only gave a warning, and many simpler decoders will happily decode them. Overlong forms have been used to bypass security validations in high profile products including Microsoft's IIS web server. Therefore, great care must be taken to avoid security issues if validation is performed before conversion from UTF-8, and it is generally much simpler to handle overlong forms before any input validation is done.
Another common problem is decoders that do not check that the trailing bytes are really trailing bytes. This will cause more characters to be lost than necessary if some bytes are lost or corrupted.
To maintain security in the case of invalid input, there are a few options. The first is to decode the UTF-8 before doing any input validation checks. The second is to use a decoder that, in the event of invalid input either returns an error or text that the application knows to be harmless. A third possibility is to not decode the UTF-8 at all, this is quite practical if the system only treats some ASCII characters (like slash and NUL) specially, and treats all other bytes as identifiers or other data but requires care to avoid passing invalid UTF-8 to other code (such as third party libraries or an operating system) that cannot safely handle it.
|
|
This section does not cite any references or sources. (April 2008) Please help improve this section by adding citations to reliable sources. Unverifiable material may be challenged and removed. |
A common criticism from beginners of variable-length encoding such as UTF-8 is that the algorithms to find the number of characters between two points, or the point that is n characters after another point, are not O(1) (constant time), causing programs using them to be slower. However the use of these algorithms by actual working software is often over-estimated:
So while the number of octets in a UTF-8 string or substring is related in a more complex way to the number of code points than for UTF-32, it is very rare to encounter a situation where this makes a difference in practice, and this cannot be used as either an advantage or disadvantage of UTF-8.
Why are we here?
All text is available under the terms of the GNU Free Documentation License
This page is cache of Wikipedia. History