Endianness

From Free net encyclopedia

Endianness generally refers to sequencing methods used in a one-dimensional system (such as writing or computer memory). The two main types of endianness are known as big-endian and little-endian. Systems which exhibit aspects of both conventions are often described as middle-endian. When specifically talking about bytes in computing, endianness is also referred to as byte order.

1 Explanation
2 Endianness in computers
3 Endianness in communications
4 Endianness of date formats
5 Discussion, background, etymology
6 External links

[edit]

Explanation

When a sequence of small units is used to form a larger ordinal value, convention must establish the order in which those smaller units are placed. This could be considered similar to the situation in different written languages, where some (such as English and French) are written left to right, while others (such as Arabic and Hebrew) are written right to left.

The decimal numbering is big-endian when written using numbers, starting at the left with the highest order magnitude and progressing to smaller order magnitudes to the right. For example, the number 1234 starts with the thousands (in this case: one thousand) and continues through the hundreds (2) and tens (3) to units (4).

[edit]

Endianness in computers

There seem to be no significant advantages in using one method of endianness over the other, and both have remained common. Generally the byte (octet) is considered an atomic unit from the point of view of storage at all but the lowest levels of network protocols and storage formats. Therefore sequences based around single bytes (e.g., text in ASCII or one of the ISO-8859-n encodings) are not generally affected by endian issues. While variable-width text encodings using the byte as their base unit could be considered to have an inbuilt endianness this, at least in all commonly used text encodings, is fixed by the encoding's design. However, Unicode strings encoded with UTF-16 or UTF-32 are affected by endianness, because each code unit must be further represented as two or four bytes.

[edit]

Logical and arithmetical description

Note: all numerical values in this section presented in this style are in hexadecimal notation.

When some computers store a 32-bit integer value in memory, for example 4A3B2C1D at address 100, they store the bytes within the address range 100 through 103 in the following order:

Big-endian

	`100`	`101`	`102`	`103`
`...`	`4A`	`3B`	`2C`	`1D`	`...`

That is, the most significant byte (also known as the MSB, which is 4A in our example) is stored at the memory location with the lowest address, the next byte in significance, 3B, is stored at the next memory location and so on.

Architectures that follow this rule are called big-endian (mnemonic: "big end first") and include Motorola 68000, SPARC and System/370.

Other computers store the value 4A3B2C1D in the following order:

Little-endian

	`100`	`101`	`102`	`103`
`...`	`1D`	`2C`	`3B`	`4A`	`...`

That is, least significant ("littlest") byte (also known as LSB) first. Architectures that follow this rule are called little-endian (mnemonic: "little end first") and include the MOS Technology 6502, Intel x86 and DEC VAX.

In other words, endianness does not denote what the value ends with when stored in memory, but rather which end it begins with.

Note that the stated mnemonics are not the origin of the terms, see below.

Some architectures can be configured either way; these include ARM, PowerPC (but not the PPC970/G5), DEC Alpha, MIPS, PA-RISC and IA64. The word bytesexual or bi-endian, said of hardware, denotes willingness to compute or pass data in either big-endian or little-endian format (depending, presumably, on a mode bit somewhere). Many of these architectures can be switched via software to default to a specific endian format (usually done when the computer starts up); however, on some architectures the default endianness is selected by some hardware on the motherboard and cannot be changed by software (e.g., the DEC Alpha, which runs only in big-endian mode on the Cray T3E).

Middle-endian

Still other architectures, called middle-endian (or sometimes mixed-endian), may have a more complicated ordering such that bytes within a 16-bit unit are ordered differently from the 16-bit units within a 32-bit word. For instance, 4A3B2C1D is stored as:

	`100`	`101`	`102`	`103`
`...`	`3B`	`4A`	`1D`	`2C`	`...`

or alternatively:

	`100`	`101`	`102`	`103`
`...`	`2C`	`1D`	`4A`	`3B`	`...`

Middle-endian architectures include the PDP-11 family of processors. (The term pdp-endian is still sometimes used to refer specifically to the PDP-11's endianness.) The format for double-precision floating-point numbers on the VAX and ARM are also middle-endian. In general, these complex orderings are more confusing to work with than consistent big or little endianness.

Endianness also applies in the numbering of bits within a byte or word. In a consistently big-endian architecture the bits in the word are numbered from the left, bit zero being the most significant bit and bit 7 being the least significant bit in a byte. The favored bit endianness depends somewhat on where the computer users expect the binary point to be located in a number. It seems most intuitive to number the bits in the little-endian order if the byte is taken to represent an integer. In this case the bit number corresponds to the exponent of the numeric weight of the bit. However, if the byte is taken to represent a binary fraction, with the binary point to the left of the most significant bit, then the big-endian numbering convention is more convenient.

C function to check if a system is big-endian or little-endian (assumes int is larger than char and will not determine if a system is middle-endian):

#define LITTLE_ENDIAN 0
#define BIG_ENDIAN    1

int machineEndianness()
{
   int i = 1;
   char *p = (char *) &i;
   if (p[0] == 1) // Lowest address contains the least significant byte
      return LITTLE_ENDIAN;
   else
      return BIG_ENDIAN;
}

[edit]

Portability issues

Endianness has grave implications in software portability. For example, in interpreting data stored in binary format and using an appropriate bitmask, the endianness is important because different endianness will lead to different results from the mask.

Writing binary data from software to a common format leads to a concern of the proper endianness. For example saving data in the BMP bitmap format requires little-endian integers - if the data are stored using big-endian integers then the data will be corrupted since they do not match the format.

Software that needs to share information between hosts of different endianness typically uses one of two strategies. Either it can choose a single endianness for sharing data, or it can allow hosts to share data in any endianness that they choose, so long as they mark which one they are using. Both approaches have advantages: on the one hand, choosing a single endianness makes decoding easier, since software only needs to decode one format. On the other hand, allowing multiple endiannesses makes encoding easier, since software doesn't need to convert data out of its native order; and also enables more efficient communication when the encoder and decoder share a single endianness, since neither needs to change the byte order. Most Internet standards take the first approach, and specify big-endian byte order. Many vendor originated formats simply use the byte order of the platform they originated on. Some other applications, notably X11, take the second approach.

UTF-16 can be written in big-endian or little-endian order. It permits a Byte Order Mark (BOM) of 2 bytes at the beginning of a string to denote its endianness. A similar 4 byte byte-order mark can be used with the rare encoding UTF-32.

[edit]

Example programming caveat

Below is an example application, written in C, which demonstrates the dangers of programming endianness unaware:

#include <stdio.h>

int main (int argc, char* argv[])
{
  FILE* fp;

  /* Our example data structure */
  struct {
    char one[4];
    int  two;
    char three[4];
  } data;

  /* Fill our structure with data */
  strcpy (data.one, "foo");
  data.two = 0x01234567;
  strcpy (data.three, "bar");

  /* Write it to a file */
  fp = fopen ("output", "wb");
  if (fp)
  {
    fwrite (&data, sizeof (data), 1, fp);
    fclose (fp);
  }
}

This code compiles properly on an i386 machine running FreeBSD and a SPARC64 machine running Solaris, but the output is different when examined with the hexdump utility.

i386 $ hexdump -C output 
00000000  66 6f 6f 00 67 45 23 01  62 61 72 00              |foo.gE#.bar.|
0000000c

sparc64 $ hexdump -C output
00000000  66 6f 6f 00 01 23 45 67  62 61 72 00              |foo..#Egbar.|
0000000c

[edit]

Endianness in communications

In general, the NUXI problem (also called the endian problem) is the problem of transferring data between computers with differing byte order. For example, the string "UNIX", packed with two bytes per 16-bit integer, might look like "NUXI" to a machine with a different "byte sex". The problem is caused by the difference in endianness. The problem was first discovered when porting an early version of Unix from PDP-11 (a middle-endian architecture) to an IBM Series 1 minicomputer (a big-endian architecture); upon startup, the computer output replaced the string "UNIX" with "NUXI".

The Internet Protocol defines a standard "big-endian" network byte order. This byte order is used for all numeric values in the packet headers and by many higher level protocols and file formats that are designed for use over IP.

The Berkeley sockets API defines a set of functions to convert 16- and 32-bit integers to and from network byte order: the htonl and htons functions convert 32-bit ("long") and 16-bit ("short") values respectively from host to network order; whereas the ntohl and ntohs functions convert from network to host order.

Serial devices also have bit-endianness: the bits in a byte can be sent little-endian (least significant bit first) or big-endian (most significant bit first). This decision is made in the very bottom of the data link layer of the OSI model.

[edit]

Endianness of date formats

Endianness is simply illustrated by the different manners in which countries format calendar dates.

In the United States and a few other countries, dates are most commonly formatted as Month; Day; Year (e.g.: "May 24th, 2006", "5/24/2006"). This is a middle-endian order.

Most of Oceania and Europe (except Sweden, Denmark, Latvia and Hungary where ISO 8601 is most common), format dates as Day; Month; Year (e.g.: "24th May, 2006", "24/5/2006", "24/5-2006", "24.5.06"). This is little-endian.

In many other countries, including China and Japan, use of the ISO 8601 international standard ordering of dates is prevalent: Year; Month; Day (e.g., "2006 May 24th", or, more properly, "2006-05-24"). This is big-endian.

The ISO 8601 ordering scheme lends itself to straightforward computerised sorting of dates in lexicographical order, or dictionary sort order. This means that sorting algorithms do not need to treat the numeric parts of the date string any differently from a string of non-numeric characters, and the dates will be sorted into chronological order. Note that for this to work, years must always be expressed as four digits, months as two, and days as two. Thus single-digit days and months must be padded with a zero yielding '01', '02', ... , '09'.

[edit]

Discussion, background, etymology

Big-endian numbers are easier to read when debugging a program. Some think they are less intuitive because the most significant byte is at the smaller address. Some think they are less confusing because the significance order is the same as the order of normal textual character strings in the computer, just as in non-computer text (see below). A person's preference usually is based both on which convention was studied first, and on which convention the person's mental models were built.

The choice of big-endian vs. little-endian was as arbitrary as the entire concept is, and has been the subject of flame wars. Emphasizing the futility of this argument, the very terms big-endian and little-endian were taken from the Big-Endians and Little-Endians of Jonathan Swift's satiric novel Gulliver's Travels, where in Lilliput and Blefuscu Gulliver finds two factions of people in conflict over which end of an egg to crack, to a nearly religious degree.

See the Endian FAQ, including the significant essay "On Holy Wars and a Plea for Peace" by Danny Cohen (1980).

The Hindu-Arabic numeral system is used worldwide and is such that the most significant digits are always written to the left of the less significant ones. Writing left to right, this system is therefore big-endian. Writing right to left, this numeral system is little-endian. It is worth noting, however, that in quite a few languages numerical order is inconsistent with how numbers appear written and in some languages, such as Hebrew, it is common to interrupt the writing of text (right-to-left) to write a number in the opposite order (left-to-right).

Little-endian ordering has been used in compiling reverse dictionaries, where the entries begin, for example, with "a, aa, baa, ..." and end, for example, with "... buzz, abuzz, fuzz." An actual example is the pronouncing dictionary for Cantonese Template:IPA (ISBN 9629485095) which begins with “a, ba, da, dza,…” and ends with “…, tyt, tsyt, m̩, ŋ̩”.

Confusion exists in how the word endianness should be spelled. The two major variants are endianness and endianess. There are even some documents containing both variants. While neither of the two forms appears in current (non-computing) dictionaries, it appears that the former follows the pattern of similar words such as "barren" and "barrenness". Thus, endianness is more accepted.

[edit]