Prefix code

From Free net encyclopedia

Template:Cleanup-date

A prefix code is a code which meets the "prefix property", which is that no code word is a prefix of any other code word in the set. A code which uses code words {0,10,11} meets the prefix property; a code whose set is {0,1,10,11} does not because "1" is a prefix of both "10" and "11".

Prefix codes are also known as prefix-free codes, comma-free codes or instantaneous codes; even though Huffman coding is only one algorithm for deriving prefix codes, prefix codes are also widely referred to as "Huffman codes" (even, confusingly, when the codes were not produced by a Huffman algorithm.)

The prefix property permits code words to be transmitted and properly framed without the need of out-of-band markers (assuming that the receiver can correctly identify the start of the transmission and that there are no uncorrected errors in the symbol stream.) This is not possible with codes that lack the prefix property, such as our example of {0,1,10,11}: a receiver which read a "1" at the start of a code word would not know whether that was the complete code word "1" or merely the prefix of the code word "10" or "11".

Examples of prefix codes are the variable-length Huffman codes, country calling codes, ISBNs and the Secondary Synchronization Codes used in the UMTS W-CDMA 3G Wireless Standard.

This article is partly derived from Federal Standard 1037C, which uses the term comma-free code.


prefix codes are a form of entropy encoding used in lossless data compression.

When you are reading a newspaper, how do you know where one sentence ends and the next begins? You use a full-stop or a query-mark, symbols different from any other letter, number, or symbol. How do you know where one word ends and the next begins? You use a full-size space, which looks different from any letter, number or symbol. How do you know where one letter ends and the next begins? In block printing, the hairline space between letters separates them. However, this "space" is not necessary -- people are able to read cursive handwriting and shorthand even though one letter runs right into the next.

Many communication systems send information as a series of bits. Many storage systems store information as a series of bits. In some systems, the letter 'a' is transmitted as the sequence

   10000110

while the letter 'd' is transmitted as the sequence

   00100110

and the full-sized space between words is

   00000100

. Each letter is formed from a sequence of bits -- how can the computer know where one letter ends and the next one starts ?

Contents

the "comma"

Certainly one could use a special symbol -- analogous to the period at the end of the sentence -- to mark where one letter ends and the next begins. So the word "dada" could be transmitted

   00101110,10000110,00101110,10000110,

where the "," represents a special symbol, different from "1" or "0". However, modern communication systems send everything as sequences of "1" or "0" -- adding a "third symbol" would be expensive.

(In general, we call the "pause" between items a "comma").

fixed-length comma-free codes

Fortunately, the "third symbol" turns out to be unnecessary -- it's possible for a machine to receive the comma-free sequence

   00101110100001100010111010000110

and correctly decode the word "dada".

The simplest method is to make every letter the same length -- a "fixed-length code". For example, ISO 8859-15 letters are always 8 bits long. UTF-32/UCS-4 letters are always 32 bits long. ATM packets are always 424 bits long.

Often we wish a message took less time to send or less space to store. So we use data compression.

One kind of data compression is to use a different code -- one that uses fewer bits per letter.

If one uses ASCII, for example, one really needs only 7 bits per letter (assuming the standard English alphabet with upper and lower case and no accents).

variable-length codes with a comma

One can compress typical text into even fewer bits if one uses a code with a variable number of bits per letter.

If we used a custom code

 0 a
 1 d
 01 space

then the phrase "add a dad" could be compressed to

 0,1,1,01,0,01,1,0,1,

Morse code is an example of a variable-length code with a comma. The long spaces between letters, and even longer spaces between words, help people recognize where one letter/word ends, and the next begins.

Unfortunately, if we remove the commas, the resulting message

 01101001101

is ambiguous. Does a "0" followed by a "1" represent a space character, or 2 different letters ?

The ambiguity is caused because one complete code (in this case "0" for "a") is just the first part -- the prefix -- of another code (in this case, "01" for space).

variable-length comma-free codes

It is possible to specially design a variable-length code such that there is never any ambiguity. Such a specially designed code is called a "variable-length code" or a "prefix-free code".

There are many variable-length codes.

When compressing data, we wonder -- which one is the best code ? (Which code compresses the file into the fewest number of bits ?) Or in other words, which does the best entropy encoding?

If one knows ahead of time all the letters that could possibly be used, and has a good estimate of the letter frequencies, the best possible comma-free code is a Huffman code. (Usually the Huffman process generates a variable-length code. But when all the letters have the same frequency, such as previously compressed or encrypted data, and additionally the number of codewords is a power of 2, the Huffman process will generate a fixed-length code.)

All other codes (both variable-length and fixed-length) use at least as many bits than a Huffman code. (Usually there are several Huffman codes. All of them compress the file into exactly the same number of bits).

non-codes

Some data compression algorithms can compress files even smaller than Huffman compression. Generally this is because they don't use a code at all.

  • They may represent "a" by one pattern of bits in one place in the compressed file, then use that same pattern of bits to represent a completely different letter later on, as in adaptive Huffman compression.
  • They may use a pattern of bits to represent several letters, as in LZW compression -- changing any one of those bits may completely change that block of letters.
  • Or they may avoid mapping particular bits to particular letters (the definition of a code) in other creative ways, as in range encoding.

existence of prefix codes / Kraft's inequality

If we have a fixed number of symbols <math>|X|</math>(the alphabet size), then for any given list of codeword lengths <math>(l_i)_{i=1...n}</math> a prefix code exists if and only if <math>\sum_{i=1}^{n}|X|^{-l_i}\le 1</math>. This is known as Kraft's inequality.

error handling

Many communication systems are not completely error-free. There are occasionally a single bit errors (toggling a bit, losing a bit, or gaining a bit).

With fixed-length codes, an error toggling a bit causes just that one code to be received in error, but all other codes are received OK. However, with variable-length codes, losing or gaining a bit (a framing error) turns the rest of the message into gibberish. (This is why most communication protocols periodically re-synchronize. ASCII over RS-232 uses 20% of its bandwidth re-synchronizing after each character). See Synchronization Link protocol

With Fibonacci codes and unary codes, all single-bit errors cause one or two erroneous codes, but all other codes are received OK. (These codes are "self-synchronizing").

With most other variable-length codes, any kind of single-bit error turns the rest of the message into gibberish.

See also

References

  • P. Elias, Universal codeword sets and representations of integers, IEEE Trans. Inform. Theory 21 (2) (1975) 194-203.

cs:Prefixový kód pl:Kod prefiksowy