This article will lay the foundation for further articels on the topic of writing binary files.

So, what exactly do I mean with byte ordering? What is this Endianness?

Byte ordering

When writing down numbers, for example 153, we compose the number of multiple digits. In this case 1, 5 and 3. But just knowing these digits isn’t enough. Also their position in the number, their ordering, is important. While the 3 alone has a higher value than the 1, in our example the 1 means one hundret, because it appers on the first of three digits. In other words, it is the most significant digit, while the 3 is the least significant digit.

The memory of a computer is composed of bytes, which can hold a number between 0 and 255. If you want to save bigger numbers, you can just put more bytes together. Two bytes (or 16 bit) can hold a value between 0 and 65535. But here is the interesting thing: In which order do you save those 2 bytes? Should the most significant byte come first in memory, like we write down our numbers?

There is no right answer to this questions as both approches were taken on different processors. The ones who save the most significant byte first are called big endian and the ones who save the least significant byte first are called little endian. In the past there were even some mixed endian systems with a totally crazy byte layout but luckily they didn’t take over so we ignore them.

Today the most personal computers are using little endian because it’s the endianness of the widespread Intel x86-architecture. You should keep this in mind if you ever analyze raw byte data. The bytes might be in a different order than you first expect.

If you want to read more on this topic, look at the Wikipedia article, it features a lot more details.

Why is this important for binary file formats?

The ordering of the bytes in memory isn’t really important. You calculations will work and you don’t need to think about it. Problems arise when you try to send data from one computer to another with a different byte ordering. This happens a lot when dealing with internet traffic or file formats. They both are designed for sharing between different computers.

So we need to deal with different byte orderings. How? There are two solutions:

Allow both orderings and detect while reading which one was used.
This way you can always write with your own ordering but if you try
to read a file you might need to correct the ordering.
For detection many formats use some mark like the BOM of Unicode files.
Just specify for the format which byte ordering should be used.
Sounds simple and it is! Indeed, most network protocols took this approach
and decided that everything sent should be in big endian. Also many file formats
we look at decided to use just little endian. In this scenario we may need to convert
the data on writing and reading, if we use the “wrong” ordering.

To recap, we only need to correct data if:

we have multibyte data. If everything just uses a single byte we don’t need a correction.
we detected that our systems order doesn’t match the needed order.

How do we detect the ordering?

I will use C++ for this articles code.

So first, lets write functions which allow us to detect the byte order:

union _endian_test {
    uint16_t word;
    uint8_t byte[2];
};

int isLittleEndian() {
    _endian_test test;
    test.word = 0xAA00;

    return test.byte[0] == 0x00;
}

int isBigEndian() {
    _endian_test test;
    test.word = 0xAA00;

    return test.byte[0] == 0xAA;
}

Whooaa, what is happing in this snippet? The most important bit is the union at the beginning. A union is a little bit like a struct, with some important difference: Every member uses exactly the same memory. A union is a little bit special and they aren’t as useful as structs or classes. But for this task they shine!

Let’s concentrate on the isLittleEndian(): We first create a new endiantest and set the 16 bit unsigned int word to 0xAA00. This value was picked randomly, every other value would have worked, as long as the first byte is different from the second one. Because one byte can be representated by exactly two hexadecimal digits it is easy to work with them when the ordering of bytes is more important than the content itself.

And now the magic: if test.byte[0] == 0x00 is true, we know we are on a little endian system. Why? Because we use a union, test.byte[0] uses the exact same memory as the first byte of test.word. test.byte[1] would have used the same memory as the second byte of test.word.

If this first byte is 0x00 and we put 0xAA00 in word, this means that the first byte corresponds to the least significant byte.

In isBigEndian() we use the same logic but check if test.byte[0] == 0xAA which would mean that the first byte corresponds to the most significant byte.

And how do we correct the order?

To correct the order, we just need to swap the bytes. The important thing here is that you don’t do anything like adding, subtracting or so when the ordering of the data isn’t the same as the one of your system.

Let’s look at a simple byte swap for 16 bit integers:

uint16_t swap16(uint16_t in)
{
    return (in >> 8) | (in << 8);
}

This is short. We just shift all the bits by 8 positions in the input one time to the left and one time to the right. Then combine the two with a binary OR. Shift operations on unsigned integers always shift zeros in. This doesn’t work for signed ints, so we always need to cast them to unsigned before we swap them.

This is what’s happening step by step:

0xFF03 <=> 1111 1111 0000 0011
1111 1111 0000 0011 >> 8 => 0000 0000 1111 1111
1111 1111 0000 0011 << 8 => 0000 0011 0000 0000
0000 0000 1111 1111 OR 0000 0011 0000 0000 => 0000 0011 1111 1111
0000 0011 1111 1111 <=> 0x03FF

I’ve implemented this swapping also for 32 bit and 64 bit but I just include them without explanation. The idea is really the same, it’s just more bytes. In case of 64 bit I’ve reused the swap32 bit to avoid using tons of zeros in the source. This way I save some error prone repetition. Try to understand what is happening yourself.

uint32_t swap32(uint32_t in)
{
    return  ((in & 0xFF000000) >> 24) | ((in & 0x00FF0000) >>  8) | 
            ((in & 0x0000FF00) <<  8) | ((in & 0x000000FF) << 24);
}

uint64_t swap64(uint64_t in)
{
    uint64_t a = swap32(0xFFFFFFFF & in);
    uint64_t b = swap32(((0xFFFFFFFFL << 32) & in) >> 32);

    return (a << 32) | b;
}

So now we can detect the endianness of the system and we can swap bytes. Great! This is really everything we need. The most of the time, at least when writing data, we know what the endianness of the data should be but we don’t know our own endianness. So we should write helper for that job that swap the bytes only if we have a different endianness then we want.

uint16_t toLittleEndian16(uint16_t in) 
{
    return isLittleEndian() ? in : swap16(in);
}

uint16_t toBigEndian16(uint16_t in)
{
    return isBigEndian() ? in : swap16(in);
}

Nothing really special. I’ve excluded the four functions for 32 bit and 64 bit but you can find them on GitHub because I’ve bundled the code of this article as a SHL and uploaded it on GitHub: nh_byteorder.hpp.

NHollmann

Byte order: Big Endian vs. Little Endian

Byte ordering

Why is this important for binary file formats?

How do we detect the ordering?

And how do we correct the order?