Tag Archives: google

Protobuf data encoding for numeric datatypes

by Mikhail Vorontsov

This article will give you a short overview of Google protobuf varint encoding and its applicability to financial data processing.

Google protobuf defines the data serialization format useful for both data storage and RPC calls. It is widely used in Google itself (according to Protobuf documentation). It defines serialization formats for all datatypes ( check this page for more details ), but here is the overview:

Protobuf encoding for most used datatypes

Datatypes Description
Unsigned integral types (int32, int64, bool, enum)

Protobuf relies on the idea that average data contains more small numbers rather than large ones. Varint encoding contains a number of bytes. The most significant bit of each byte (MSB) tells if it is the last byte in the encoding (0) or there are more bytes (1). The first byte in the sequence contains the lowest 7 bits of the original number, the next byte contains the next 7 bits (bits 7-13) and so on. We stop encoding when there are no more set bits (=1) left in the number. For example, 34 (22 hex) will be encoded as 0 010 0010 = 22 hex = 34 in this scheme. On the other hand, 255 (1111 1111) will be encoded in 2 byte sequence: 1111 1111, 0000 0001 – first byte contains 7 lowest bits of an original number and the MSB tells us that there will be one more byte. Second byte contains the bit 7 of an original number. MSB of the second byte is set to zero – no more bytes are expected.

You may notice that this encoding is similar (but not identical) to UTF-8 encoding. It favors the small values over the big values, which is the case of most real world data applications (for example, volume traded in the single trade on the stock exchange will tend do be small).

Signed integral types You can not apply the previous encoding scheme to the fields that can get negative. All negative numbers have the highest bit set, so you will always have to spend the maximal amount of space for negative values encoding. Due to this problem, Protobuf uses a ZigZag encoding, which transforms a signed 32/64 bit number into an unsigned one: 0 is converted into 0, -1 into 1, 1 into 2, -2 -> 3, 2 -> 4 and so on. This scheme still favors small (by their absolute value) numbers by sacrificing just a single bit. A number converted with ZigZag is encoded using VarInt encoding afterwards.
Floating point numbers These numbers are written as fixed length 32/64 bit values using Float.floatToIntBits / Double.doubleToLongBits. Varint encoding is not applicable to these numbers due to their different bit structure (sign-exponent-mantissa).
Strings Strings are encoded as a VarInt encoded string length followed by UTF-8 encoded string contents (I have already written about similar in-memory data compression technique in String packing part 2: converting Strings to any other objects article).

Protobuf defines support for repeated fields as well as nested messages, but these are outside the scope of this article. You can read the official documentation for more details.

Continue reading