Punycode

From Free net encyclopedia

Unicode
Encodings UTF-7 UTF-8 CESU-8 UTF-16/UCS-2 UTF-32/UCS-4 UTF-EBCDIC SCSU Punycode GB 18030
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and e-mail

Punycode, defined in RFC 3492, is the self-proclaimed "bootstring encoding" of Unicode strings into the limited character set supported by the Domain Name System. The encoding is used as part of IDNA, which is a system enabling the use of internationalized domain names in all languages that are supported by Unicode, where the burden of translation lies entirely with the user application (a web browser for example).

The encoding is applied separately to each component of a domain name which is not represented solely within the ASCII character set, and a reserved prefix 'xn--' is added to the translated Punycode string. For example, bücher becomes bcher-kva in Punycode, and therefore the domain name bücher.ch would be represented as xn--bcher-kva.ch in IDNA.

1 Encoding of non-ASCII character insertions as code numbers
2 Re-encoding of code numbers as ASCII sequences
3 Spoofing concerns
4 External links

[edit]

Encoding of non-ASCII character insertions as code numbers

Special characters are removed from the string, while at the end a sequence of codes is added, one code for each insertion of a special character; these insertions are done primarily in the order of their Unicode-values, and secondarily in the order in which they occur in the string. The code for each insertion represents the number of possibilities of inserting a special character at the given stage (that is, without regard to characters that will be inserted afterwards), before the actual insertion, where these possible insertions are again ordered primarily according to their Unicode-values, and secondarily according to position. The first possibility, denoted by the code "a", means that character 128 is inserted at the beginning of the string, or, if there has already been an insertion of a special character, that the same character is added again immediately after the previous one.

The described coding is a form of delta encoding. Special characters in a word are usually from the same language, hence often with nearby Unicode values. Thus the numbers to be used are often smaller with this method. In the case of multiple occurrences of a character it also helps that positions are counted from the previous position.

In the case of "bücher", the code "kva" is used for inserting "ü" (character 252) in "bcher". Of all possibilities of inserting a special character somewhere in "bcher", there are potentially the characters 128–251, each in six possible positions, as well as "ü" in front of the "b", which come before the actual insertion of "ü" after the "b", hence 124 × 6 + 1 = 745 possibilities.

[edit]

Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable length integers to represent these values. For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits (like the third digit from the right in ordinary numbers having a weight 100) varies.

In this case a "number system" with 36 "digits" is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0". The second digit has a weight of 35 instead of 36 because for three-digit numbers the first (least significant) digit is in the range b-9, "a" would mark the end of the number. Therefore "kva" represents the number 10 + 35 × 21 = 745.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Compare an ASCII 'punycoded' URL http://xn--tdali-d8a8w.lv/ (working) and its full Unicode counterpart that does include Latvian characters with appropriate diacritics: http://tūdaliņ.lv.

Punycode is designed to work across all script systems, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being Punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string.

[edit]

Spoofing concerns

Because Punycode allows websites to use full Unicode names, IDNA could leave their users open to phishing attacks. IDNA makes it possible to create a spoofed web site that looks exactly like another, including domain name and security certificate, but in fact is controlled by someone attempting to steal private information. See Internationalizing Domain Names in Applications for more.

Rather than preventing users from accessing internationalized websites, the Firefox and Opera web browsers utilize a white-list for domain registrars that regulate against possible spoofing exploits at the time of domain name registration. Hence, a white-listed TLD will display the Unicode name, whereas untrusted domains will display the Punycode name with the xn-- prefix. Characters from Latin-1 are allowed for all TLDs, even those not on the white-list, as within Latin-1 there is little chance for an exploit using misleading characters.

Safari, as of Security Update 2005-003, does the same for a configurable list of scripts including the three most likely to mislead: Greek, Cyrillic, and Cherokee.

Policies which satisfy Mozilla and Opera's criteria for preventing spoofing have included:

allowing characters in names to be chosen only from a small set with no homographic pairs
employing a "confusables" character list to search for visually equivalent names before allowing registration
employing a more elaborate, but effective, policy such as the JET Guidelines (RFC 3743) for Chinese, Japanese and Korean names

[edit]