Character encoding

From Free net encyclopedia

A character encoding consists of a code that pairs a sequence of characters (units of information corresponding to graphemes or grapheme-like units, such as might appear in an alphabet or syllabary for the communication of a natural language) from a given set with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the storage of text in computers and the transmission of text through telecommunication networks. Common examples include Morse code, which encodes letters of the Latin alphabet as series of long and short depressions of a telegraph key; and ASCII, which encodes letters, numerals, and other symbols, both as integers and as 7-bit binary versions of those integers, generally extended with an extra zero-bit to facilitate storage in 8-bit bytes (octets).

In earlier days of computing, the introduction of character sets such as ASCII (1963) and EBCDIC (1964) began the process of standardisation. The limitations of such sets soon became apparent, and a number of ad-hoc methods developed to extend them. The need to support multiple writing systems, including the CJK family of East Asian scripts, required support for a far larger number of characters and demanded a systematic approach to character encoding rather than the previous ad hoc approaches.

Contents

Simple character sets

Conventionally character set and character encoding were considered synonymous, as the same standard would specify both what characters were available and how they were to be encoded into a stream of code units (usually with a single character per code unit). For historical reasons, MIME and systems based on it use the term charset to refer to the complete system for encoding a sequence of characters into a sequence of octets.

Modern encoding model

Unicode and its parallel standard, ISO 10646 Universal Character Set, which together constitute the most modern character encoding, broke away from this idea, and instead separated the ideas of what characters are available, their numbering, how those numbers are encoded as a series of "code units" (limited size numbers), and finally how those units are encoded as a stream of octets (bytes). The idea behind this decomposition is to establish a universal set of characters that can be encoded in a variety of ways. To correctly describe this model needs more precise terms than "character set" and "character encoding".

A character repertoire means the full set of abstract characters that a system supports. The repertoire may be closed, that is no additions are allowed without creating a new standard (as is the case with ASCII and most of the ISO-8859 series), or it may be open, allowing additions (as is the case with Unicode and to a limited extent the Windows code pages).

A coded character set specifies how to represent a repertoire of characters using a number of non-negative integer codes. multiple coded character sets may share the same repertoire, for example ISO-8859-1 and IBM code pages 037 and 500 all cover the same repertoire but map them to different codes.

A character encoding form (CEF) specifies the conversion of the integer code into a series of fixed size integer code values that facilitate storage in a system that uses fixed bit-widths (e.g. virtually any computer system). The simplest system is simply to choose a large enough fixed size that the values from the coded character set can be encoded directly. This works well for coded character sets that fit in 8 bits (as most legacy non-CJK encodings do) and reasonably well for coded character sets that fit in 16 bits (such as early versions of Unicode). However, as the size of the coded character set increases (e.g. modern Unicode requires at least 21 bits/character), this becomes less and less efficient, and it is difficult to adapt existing systems to use larger code values. Therefore, most systems working with Unicode today use either UTF-8, which maps Unicode code points to variable-length sequences of octets, or UTF-16, which maps Unicode code points to variable-length sequences of 16-bit words.

Finally, a character encoding scheme (CES) specifies how the fixed-size integer codes should be mapped into an octet sequence suitable for saving on an octet-based file system or transmitting over an octet-based network. With Unicode in most cases a simple character encoding scheme is used, simply specifying if the bytes for each integer should be in big-endian or little-endian order (even this isn't needed with UTF-8). However, there are also compound character encoding schemes, which use escape sequences to switch between several simple schemes (such as ISO 2022), and compressing schemes, which try to minimise the number of bytes used per code unit (such as SCSU, BOCU, and Punycode).

History of character encodings

Popular character encodings

See also

External links

da:Tegnsæt de:Zeichencodierung es:Codificación de caracteres fr:Codage_de_caractères gl:Codificación de caracteres ja:文字コード nn:teiknsett zh:字符集 zh-min-nan:Pian-bé