I’m working on a serialization format that’s meant to be used with Javascript. To keep the size low, numbers that tend to be smaller are going to be implemented with variable width integers. One issue with Javascript is its numbers are implemented quirkily - spec-wise, they’re all 64 bit floating point numbers, but bitwise operations are done as though they’re 32 bit integers. This makes it difficult to work (at a bit level) with numbers that are larger than 32 bits, as you can’t do bitwise operations on them1. On the bright(?) side, in Javascript, you’ll often hit a memory or size limit before enumerating something higher than 32 bits. All this to say, variable width integers in my format are going to be limited to 32 bits at most.
For the implementation of these numbers, I was going to go with the good old LEB128 encoding, but I wanted to see if I could figure out something that would take advantage of the 32-bit integer limit. The format I ended up with is the subject of this post - a variable width 30-bit integer format that I’m calling var30
.
A var30
uses the first 2 bits of the first byte to determine how many of the trailing bytes are unused. For example, a full 30 bit integer would begin with 0b00
, but a small 2 bit integer would begin with 0b11
. The other bits are used to store the integer itself. It’s easier to understand with the code:
uint32_t encode(uint32_t value) {
if (value > ((1 << 30) - 1)) __builtin_unreachable();
uint8_t bytes = (__builtin_clz(value) - 2) & 0b11000;
return (value << bytes) | (bytes << 27);
}
It starts with a hint that the value is less than 30 bits to help the compiler do its work. __builtin_clz(value) - 2
gets the number of leading zeros minus the empty two zeroes at the start (guaranteed due to the value being below 1 << 30). Leading zeroes tells us the “useless” information in bits, and masking it with 0b11000
floors it to the closest byte threshold (0/8/16/24). Then the parts of the encoded value are shifted into place and OR’d together.
uint32_t decode(uint32_t n) {
uint8_t bytes = (n >> 27) & 0b11000;
n &= 0x3FFFFFFF;
return n >> bytes;
}
Decoding is, as you’d expect, the reverse. You can tell the bytes
calculation is the inverse of the bytes
masking above - the number is shifted by 27 in the opposite direction, and the 0b11000
mask comes back. Then the top 2 bits are masked off, and the value is shifted back to its original value.
Overall, encode and decode compile down to just a few instructions each, and are pretty easy to understand. The only caveat I can think of is that the first byte can only store 6 bits vs. LEB128 having 7, but I think 2 ** 6 == 64
is reasonably high enough for most use cases. Do you have an alternative implementation or general improvement in mind? Let me know on Twitter, or reply on whatever forum you read this on.
Footnotes
-
Ignoring BigInts, which are pretty iffy performance-wise, and overall annoying to work with. ↩