Code page

From Free net encyclopedia

Code page is the traditional IBM term used for a specific character encoding table: a mapping in which a sequence of bits, usually a single octet representing integer values 0 through 255, is associated with a specific character. IBM and Microsoft often allocate a code page number to a charset even if that charset is better known by another name.

Whilst the term code page originated from IBMs EBCDIC mainframe systems the term is most commonly associated with the IBM PC code pages (known by MS as the OEM code pages and the windows ansi code pages). Most well known code pages (excluding those for the CJK languages and Vietnamese) represent character sets that fit in 8 bits and don't involve anything that can't be represented by mapping each code to a simple bitmap (such as combining characters, complex scripts etc).

The text mode of standard (VGA compatible) PC graphics hardware is built around using an 8 bit code page (though it is possible to use two at once with some color depth sacrifice and up to 8 may be stored in the display adaptor for easy switching [1]). There were a selection of code pages that could be loaded into such hardware. However, it is now commonplace for operating system vendors to provide their own character encoding and rendering systems that run in a graphics mode and bypass this system entirely. The character encodings used by these graphical systems (particularly windows) are sometimes called code pages as well.

1 Relationship to ASCII
2 IBM PC (OEM) code pages
3 Other code pages of note
4 Windows (ANSI) code pages
5 Private code pages
6 See also
7 External links

[edit]

Relationship to ASCII

The basis of the IBM PC code pages is ASCII, a 7-bit code representing 128 characters and control codes. In the past, 8-bit extensions to the ASCII code often either set the top bit to zero, or used it as a parity bit in network data transmissions. When this bit was instead made available for representing character data, another 128 characters and control codes could be represented. IBM used this extended range to encode characters used by various languages. No formal standard existed for these ‘extended character sets’; IBM merely referred to the variants as code pages, as it had always done for variants of EBCDIC encodings.

[edit]

IBM PC (OEM) code pages

These codepages are most often used under MS-DOS-like operating systems; they include a lot of box drawing characters. Since the original IBM PC code page (number 437) was not really designed for international use, several incompatible variants emerged. Microsoft refers to these as the OEM code pages. Examples include:

437 — The original IBM PC code page
737 — Greek
850 — "Multilingual (Latin-1)" (Western European languages)
852 — "Slavic (Latin-2)" (Eastern European languages)
855 — Cyrillic
857 — Turkish
858 — "Multilingual" with euro symbol
860 — Portuguese
861 — Icelandic
863 — French Canadian
865 — Nordic
866 — Cyrillic
869 — Greek

[edit]

Other code pages of note

10000 — Macintosh Roman encoding (followed by several other Mac character sets)
10007 — Macintosh Cyrillic encoding
10029 — Macintosh Central European encoding
932 — Supports Japanese
936 — GBK Supports Simplified Chinese
949 — Supports Korean
950 — Supports Traditional Chinese
1200 — UCS-2LE Unicode little-endian
1201 — UCS-2BE Unicode big-endian
65001 — UTF-8 Unicode

In modern applications, operating systems and programming languages, the IBM code pages have been rendered obsolete by international standards, such as ISO 8859-1 and Unicode.

[edit]

Windows (ANSI) code pages

Microsoft defined a number of code pages known as the ANSI code pages (as the first one, 1252 was based on an ansi draft of what became ISO 8859-1). Code page 1252 is built on ISO 8859-1 but uses the range 0x80-0x9F for extra printable characters rather than the C1 control codes used in ISO-8859-1. Some of the others are based in part on other parts of ISO 8859 but often rearranged to make them closer to 1252.

1250 — East European Latin
1251 — Cyrillic
1252 — West European Latin
1253 — Greek
1254 — Turkish
1255 — Hebrew
1256 — Arabic
1257 — Baltic
1258 — Vietnamese

Many Microsoft products produce characters in these ranges automatically, notably with ‘smart quotes’. This means that other software has to choose between

not interoperating with documents produced with Microsoft applications
mis-rendering the text in question
adding support for the Microsoft code pages, in effect making Microsoft's implementation a de facto standard.

Worse, Microsoft applications mislabeled text in Windows-1252 as ISO-8859-1 and many ignorant Windows developers followed their example. Whilst current Microsoft applications seem to correctly label Windows-1252 text as such when they can (such as when sending e-mail), they still allow both reading and writing (eg. through Forms) these characters on websites declared as ISO-8859-1. The most popular competing web browsers do so too, presumably because they valued compatibility over standards compliance.

These code pages were sometimes viewed as part of Microsoft's embrace, extend and extinguish strategy towards open standards (though its difficult to see how something as simple as an 8 bit character table could ever really be kept proprietary). On the other hand, since standards bodies had decided to not assign graphical characters to the upper-half control-character positions 80–9F, which are hardly used in practice for control functions, a precious 12.5% of the available code positions appeared to be wasted. This, perhaps, was not in users' best interests, either.

[edit]

Private code pages

When, early in the history of personal computers, users didn't find their character encoding requirements met, private or local codepages were created using Terminate and Stay Resident utilities or by re-programming BIOS EPROMs. In some cases, unofficial code page numbers were invented (e.g., cp895).

When more diverse character set support became available most of those code pages fell into disuse, with some exceptions such as the Kamenický or KEYBCS2 encoding for the Czech and Slovak alphabets.

[edit]