Character encodings in HTML
From Free net encyclopedia
Template:Html series HTML has been in use since 1991, but the first standardized version with a reasonably complete treatment of international characters was version 4.0, not published until 1997. Considerable care must be exercised when creating HTML pages with special characters outside the range of seven-bit ASCII to ensure two goals: the integrity of the information stored in the HTML document, and proper display of the document by the largest possible variety of browsers.
Contents |
The document character encoding
When HTML documents are served to the viewer, there are three ways to tell the browser what specific character encoding is used. First, HTTP headers can be sent by the server along with each page. A typical header looks like this:
Content-Type: text/html; charset=ISO-8859-1
For HTML (not usually XHTML), the other method is for the HTML document to include this information at its top, inside the HEAD
element.
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
XHTML documents have a third option: to express the character encoding in the XML preamble, for example
<?xml version="1.0" encoding="ISO-8859-1"?>
Each of these method advises the receiver that the file being sent uses the character encoding specified. The character encoding is often referred to as the 'character set' and it indeed does limit the characters in the raw source text. However the HTML standard states that the "charset" is to be treated as an encoding of unicode characters and provides a way to specify characters that the "charset" does not cover. The term Code page is also used similarly.
It is a bad idea to send incorrect information about the character encoding in use. For example, a server where multiple users may place files created on different machines cannot promise that all the files it sends will conform (some users may have machines with different character sets). For this reason, many servers simply do not send the information at all, to avoid making any false promises. This however may result in the equally bad situation of the user agent displaying the document wrongly because it does not know which character encoding to use.
The specification in the HTTP headers overrides a specification in a meta element in the document itself, which can be a problem if the headers are incorrect and one does not have the access or the knowledge to change them.
Browsers receiving a file with no character encoding information must make a blind assumption. For Western European languages, it is typical and fairly safe to assume windows-1252 (which is similar to ISO-8859-1 but has printable characters in place of some control codes that are forbidden in HTML anyway), but it is also common for browsers to assume the character set native to the machine on which they are running. The consequence of choosing incorrectly is that characters outside the printable ASCII range (32 to 126) usually appear incorrectly. This presents few problems for English-speaking users, but other languages require characters outside that range for everyday use. In CJK environments where there are several different multibyte encodings in use, autodetection is often employed.
It is increasingly common for multilingual websites to use one of the Unicode/ISO 10646 transformation formats, as this allows use of the same encoding for all languages. Generally UTF-8 is used rather than UTF-16 or UTF-32 because it is easier to handle in programming languages that assume a byte-oriented ASCII superset encoding, and it is efficient for ASCII-heavy text (which HTML tends to be).
Successful viewing of a page is not necessarily an indication that its encoding is specified correctly. If the creator of a page and the reader are both assuming some machine-specific character encoding, and the server does not send any identifying information, then the reader will nonetheless see the page as the creator intended, but other readers with different native sets will not.
Character references
In addition to native character encodings, characters can also be encoded as character references, which can be numeric character references (decimal or hexadecimal) or character entity references. Character entity references are also sometimes referred to as named entities, or HTML entities for HTML. Usage of character references derives from SGML.
Character entity references have the format &name; where "name" is a case-sensitive alphanumeric string. For example, the character 'λ' can be encoded as λ
in an HTML 4 document. Characters <, >, " and & are used to delimit tags, attribute values, and character references. Character entity references <, >, " and &, which are predefined in HTML, XML, and SGML, can be used instead for literal representations of the characters.
Numeric character references can be in decimal format, &#DDD;, where "DDD" is a variable width string of decimal digits. Similarly there is a hexadecimal format, &#xHHH;, where "HHH" is a variable width string of hexadecimal digits. Unlike named entities, hexadecimal character references are case-insensitive in HTML. For example, λ can also be represented as λ
, λ or λ.
Numeric references always refer to Unicode values, irrespective of page encoding. Using numeric references which lie within the reserved control area of Unicode is therefore illegal. That is, all characters in the (hex) ranges 00–1F, 7F, and 80–9F, or � to  and  to Ÿ.
Unnecessary use of HTML character references may significantly reduce the readability of HTML. If the character encoding for a web page is chosen appropriately then HTML character references are usually only required for a few special characters.
XML character entity references
Unlike traditional HTML with its large range of character entity references, in XML there are only five predefined character entity references. These are used to escape characters that are markup sensitive in certain contexts:
- & = & (ampersand, U+0026)
- < = < (left angle bracket, less-than sign, U+003C)
- > = > (right angle bracket, greater-than sign, U+003E)
- " = " (quotation mark, U+0022)
- ' = ' (apostrophe, U+0027)
All other character entity references have to be defined before they can be used. For example, use of é (which gives é, Latin small letter E with acute, U+00E9, in HTML) in an XML document will generate an error unless the entity has already been defined. XML also requires that the x in hexadecimal numeric references be in lowercase: for example ਛ rather than ਛ. XHTML, which is an XML application, supports the HTML 4 entity set and XML's ' entity, which does not appear in HTML 4.
However, use of ' in XHTML should generally be avoided for compatibility reasons.
HTML character entity references
For a list of all HTML character entity references, see List of XML and HTML character entity references (approx. 250 entries).
For a list of all HTML decimal character references, see List of HTML decimal character references. (approx. 10,000 entries).
See also
External links
- Character entity references in HTML 4
- An Interactive HTML Entities Tool
- A Simple Character Entity Chart: Browser support table
- Character Entities for XHTML
- HTML Character Reference — HTML with Style — Webreference.com
- HTML Document Character Set Table
- HTML Entity Chart
- HTML Tag List: Character set and special characters