Skip to content

bitstring.Array('uintle12')? #354

@fdwr

Description

@fdwr

TLDR

😎 Nice library.💡 Supporting bitstring.Array("uintle12") for my data array would be useful. 🙏

Version: bitstring==4.3.0
Urgency: non-blocking, as I figured out an inefficient workaround to swap endianness BE<->LE manually, and although it has some limitations and is not fully general, my data happens to align nicely to those constraints.

Problem

I have some 12-bit graphics data stored in little endian layout (along with other occurrences found in the wild like FAT-12 tables stored on floppy disk images) for which I tried this library for yesterday, but alas bitstring.Array appears to only support big endian layout for 12-bit data 🤔, as the raw byte data for "uint12" on my x86 machine from bitstring.Array's tobytes is clearly big endian (like TCP/IP field layout), where the first element's 8 MSB's are stored in byte[0], the 4 LSB's are stored in the high nibble of byte[1], then the second element's 4 MSB's in the low nibble of byte[1], and the 8 LSB's in byte[2]:

Desired little-endian element layout 🙂:

Absolute bit index:  00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ...
Dword:               [---------------------------------------------00----------------------------------------------] ...
Byte:                [---------00----------] [---------01----------] [---------02----------] [---------03----------] ...
Bit in byte:         00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 ...

Element index:       [---------------00----------------] [---------------01----------------] [---------------02----- ...
Bit of element:      00 01 02 03 04 05 06 07 08 09 10 11 00 01 02 03 04 05 06 07 08 09 10 11 00 01 02 03 04 05 06 07 ...

Actual element layout of "uint12", which is big-endian 🙃:

Absolute bit index:  00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ...
Dword:               [---------------------------------------------00----------------------------------------------] ...
Byte:                [---------00----------] [---------01----------] [---------02----------] [---------03----------] ...
Bit in byte:         00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 ...

Element index:       (---------00----------] (---01----] [---00----) [---------01----------) (---------02----------] ...
Bit of element:      04 05 06 07 08 09 10 11 08 09 10 11 00 01 02 03 00 01 02 03 04 05 06 07 04 05 06 07 08 09 10 11 ...

Tried

  • bitstring.Array("uintle12") yields ValueError: Inappropriate Dtype for Array: 'uintle12'
  • bitstring.Array("<uint12"); yields ValueError: Inappropriate Dtype for Array: '<uint12'.
  • bitstring.options.lsb0 = True just seems to reverse the element direction while still keeping the actual layout across bytes in BE.

Feature request

Please add dtypes for uintle12 (plus uintbe12 for symmetry as an alias of the current uint12) when/if you have time.

Additionally I have image data in LE 2-bits-per-pixel and 4-bits-per-pixel that would be nice to work with, but my 4bpp image array...

Image

...instead looks like:

Image

Supporting "uintle4" and "uintle2" would remedy that:

Image -> Image

Mathematically for 2bpp:

pixel[0] = (byte[0] >> 0) & 0x03 // bits 0..2
pixel[1] = (byte[0] >> 2) & 0x03 // bits 2..4
pixel[2] = (byte[0] >> 4) & 0x03 // bits 4..6
pixel[3] = (byte[0] >> 6) & 0x03 // bits 6..8
pixel[4] = (byte[1] >> 0) & 0x03 // bits 8..10
pixel[5] = (byte[1] >> 2) & 0x03 // bits 10..12
pixel[6] = (byte[1] >> 4) & 0x03 // bits 12..14
pixel[7] = (byte[1] >> 6) & 0x03 // bits 14..16
...
pixel[i] = (byte[i >> 2] >> (i * 2 & 0x07)) & 0x03
______________________________________________________________________________________________________________

Absolute bit index:  00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 ...
Dword:               [---------------------------------------------00----------------------------------------------] ...
Byte:                [---------00----------] [---------01----------] [---------02----------] [---------03----------] ...
Bit in byte:         00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 00 01 02 03 04 05 06 07 ...

Element index:       [00 ] [01 ] [02 ] [03 ] [04 ] [05 ] [06 ] [07 ] [08 ] [09 ] [10 ] [11 ] [12 ] [13 ] [14 ] [15 ] 
Bit of element:      00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 00 01 ...

Note

While older EGA/VGA-based images used BE layout (where the low-index pixel was in the high-index bits, and the high-index pixel was the low-index bits), newer formats on video game consoles and machine learning tensors follow the convention that higher index elements are stored in higher bits.

This actually applies to any element bit size that isn't a multiple of 8, where BE (TCP/IP) fills from MSB to LSB, and LE (x86) fills bits from LSB to MSB, meaning that even oddities like uintle3 or uintle5 should work consistently too, but for my needs, 2-bit, 4-bit, and 12-bit are most important.

Workarounds

Since endianness can be thought more generically as a mapping between logical bit indices and actual bit indices (not simply "how bytes are arranged in a word"), then it's possible to transform between endianness (LE <-> BE) by reversing the direction of all the elements and reversing the direction of all the bytes. So currently I call swapBE8toLE8 before serializing back out to the file, but it would be nicer to handle this directly in-place with the array via direct "uintle#" support (then no extra memory rewrites/copies, potential forgetfulness as you pass the array around to other parts of the program, or boundary condition issues like when the total bits count is not a multiple of 8).

swapBE8toLE8(outputArray)
...

def swapBE8toLE8(array : bitstring.Array):
    if (array.itemsize % 8) == 0:
        # Faster shortcut for byte-size elements which work directly.
        # (but the "else" branch below would work too).
        array.byteswap()
    else:
        # Slower work-around to swap endianness layout for non-byte multiples.
        # This still has the constraint that the total bit count must be a multiple of 8
        # because otherwise the byte reversal fails because of the fractional trailer,
        # whereas a direct implementation would not have that issue.
        originalDtype = array.dtype
        array.reverse()
        array.dtype = "uint8"
        array.reverse()
        array.dtype = originalDtype
    #endif
#endif

def swapLE8toBE8(array : bitstring.Array):
    if (array.itemsize % 8) == 0:
        # Faster shortcut for byte-size elements which work directly.
        # (but the "else" branch below would work too).
        array.byteswap()
    else:
        # Slower work-around to swap endianness layout for non-byte multiples.
        # This still has the constraint that the total bit count must be a multiple of 8
        # because otherwise the byte reversal fails because of the fractional trailer,
        # whereas a direct implementation would not have that issue.
        originalDtype = array.dtype
        array.dtype = "uint8"
        array.reverse()
        array.dtype = originalDtype
        array.reverse()
    #endif
#endif

Note

These two functions are distinct, and you can't just call swapBE8toLE8 a second time on the same data to reverse it, because permuting LE to either BE8 or BE16 (so-called "middle endian") isn't always the same as unpermuting back to little endianness. They are notably symmetric when the element bit size is a multiple of the minimal address unit size (which is 8 bits on most architectures, or 16 bits on a few oddities like the NUXI PDP-11), and so calling swapBE8toLE8 twice on the same array restores the original data then.

Important

I've often seen this belief that endianness is purely an architectural hardware trait of how bytes are arranged within a given word unit, but this isn't a complete picture. When you think about units that straddle across bytes (and read architectural diagrams for TCP/IP or documents like the GenICam Pixel Format Naming Convention), you realize that endianness indirectly also implies the direction bitfields flow within and across each byte, because it makes the most sense (if you want any reasonable efficiency without a bunch of bit slicing and masking/or'ing, especially when reading larger word units than bytes and progressively shifting bits) for BE architectures to store fields MSB->LSB and LE to store LSB->MSB.

Related

🫡 Thanks from Redmond Washington.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions