Extended Unix Code
From Free net encyclopedia
Extended Unix Code (EUC) is an 8-bit character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The structure of EUC is based on the ISO 2022 standard, which specifies a way to represent character sets containing up to 8836 (94×94) or 830584 (94×94×94) characters as sequences of 7-bit codes. Only ISO 2022–compliant character sets can have EUC forms. Moreover, because escape sequences are not used, each EUC encoding can only represent one particular corresponding ISO 2022 character set.
In actual use, EUC is always implicitly paired up with an ISO 646–compliant, 7-bit, single-byte code (often ASCII) with the most significant bit cleared. To get the EUC form of an ISO 2022 character, the most significant bit of each 7-bit byte of the original ISO 2022 codes is set (by adding 128 to each of these original 7-bit codes); this allows software to easily distinguish whether a particular byte in a character string belongs to the ISO 646 code or the ISO 2022 (EUC) code.
The most commonly-used EUC codes are two-byte codes, based on an ISO 2022–compliant character set up to 8836 characters. The EUC-CN form of GB2312, plus EUC-JP and EUC-KR, are examples of such two-byte EUC codes.
Contents |
EUC-CN
EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters. Unlike the case of Japanese, the ISO-2022 form of GB2312 is not normally used, though a variant form called HZ was sometimes used on USENET.
EUC-CN is also the usual way to use the GB18030 standard which includes traditional characters. Because GB18030 is backward-compatible with GB2312, and because modern Unicode-based operating systems allows users to (sometimes accidentally) use characters outside GB2312, nowadays it is more correct to say that EUC-CN is a form of GB18030. However, GB18030 in EUC-CN form is a variable-length code because GB18030 contains more than 8836 (94×94) characters.
Related encoding systems
An encoding related to EUC-CN is the "748" code used in the WITS typesetting system developed by Beijing's Founder Technology (now obsoleted by its newer FITS typesetting system). The 748 code contains all of GB2312, but is not ISO 2022–compliant and therefore not a true EUC code. (It uses an 8-bit lead byte but distinguishes between a second byte with its most significant bit set and one with its most significant bit cleared, and is therefore more similar in structure to Big5 and other non–ISO 2022–compliant DBCS encoding systems.) The non-GB2312 portion of the 748 code contains traditional and Hong Kong characters and other glyphs used in newspaper typesetting.
EUC-JP
EUC-JP is a way to use the Japanese JIS X0208, JIS X0213, and other related standards (usually called just the "JIS character set"). (ISO-2022-JP is another way to use the same set of standards.)
In the EUC-JP coding for Japanese, the code for a character in a JIS standard (i.e., the 7-bit values used in ISO-2022-JP) are simply incremented by 128. This allows the easy mixing of 7-bit ASCII and 8-bit Japanese without the need for the escape characters employed by ISO-2022-JP.
The 7-bit ISO646-compliant character set paired up with EUC-JP is sometimes but not always ASCII; the lower 7 bits of JIS X0201 (which is almost identical to ASCII except that in place of a backslash, there a Yen sign) is also often matched with EUC-JP.
In Japan, the encoding is heavily used by Unix or Unix-like operating systems, but seldom used elsewhere. It is consequently the least-used of the big 3 Japanese encodings, behind both ISO-2022-JP (JIS) and Shift-JIS.
EUC-KR
EUC-KR is a similar 8-bit coding of ISO-2022-KR (KS X 1001), implemented with the same principle of simply adding 128 to each byte.
EUC-TW
EUC-TW is a rarely used encoding for traditional Chinese characters as used on Taiwan. Big5 is much more common.
Unlike normal EUC codes (the EUC-CN form of GB2312, plus EUC-JP and EUC-KR), EUC-TW is a 4-byte code; this difference arises from the fact that CNS11643 contains more than 8836 (i.e., 94×94) characters. However, instead of using a three-byte code as one might expect, an encoded character in EUC-TW has the following peculiar structure:
- Lead byte: 8E
- Second byte: A0 + plane number
- Third and fourth bytes: Two-byte EUC form of the code point in the specified plane
External links
- EUC-JP A table of the non-ascii part of the codeset.
- http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html — Information about GB18030
- http://www.jagat.or.jp/asia/report/China3.htm mentions the 748 code
- http://www.cns11643.gov.tw/web/word.jsp describes the EUC-TW code (in Chinese)de:Extended UNIX Coding