Protobuf data encoding for numeric datatypes

by Mikhail Vorontsov

This article will give you a short overview of Google protobuf varint encoding and its applicability to financial data processing.

Google protobuf defines the data serialization format useful for both data storage and RPC calls. It is widely used in Google itself (according to Protobuf documentation). It defines serialization formats for all datatypes ( check this page for more details ), but here is the overview:

Protobuf encoding for most used datatypes

Datatypes Description
Unsigned integral types (int32, int64, bool, enum)

Protobuf relies on the idea that average data contains more small numbers rather than large ones. Varint encoding contains a number of bytes. The most significant bit of each byte (MSB) tells if it is the last byte in the encoding (0) or there are more bytes (1). The first byte in the sequence contains the lowest 7 bits of the original number, the next byte contains the next 7 bits (bits 7-13) and so on. We stop encoding when there are no more set bits (=1) left in the number. For example, 34 (22 hex) will be encoded as 0 010 0010 = 22 hex = 34 in this scheme. On the other hand, 255 (1111 1111) will be encoded in 2 byte sequence: 1111 1111, 0000 0001 – first byte contains 7 lowest bits of an original number and the MSB tells us that there will be one more byte. Second byte contains the bit 7 of an original number. MSB of the second byte is set to zero – no more bytes are expected.

You may notice that this encoding is similar (but not identical) to UTF-8 encoding. It favors the small values over the big values, which is the case of most real world data applications (for example, volume traded in the single trade on the stock exchange will tend do be small).

Signed integral types You can not apply the previous encoding scheme to the fields that can get negative. All negative numbers have the highest bit set, so you will always have to spend the maximal amount of space for negative values encoding. Due to this problem, Protobuf uses a ZigZag encoding, which transforms a signed 32/64 bit number into an unsigned one: 0 is converted into 0, -1 into 1, 1 into 2, -2 -> 3, 2 -> 4 and so on. This scheme still favors small (by their absolute value) numbers by sacrificing just a single bit. A number converted with ZigZag is encoded using VarInt encoding afterwards.
Floating point numbers These numbers are written as fixed length 32/64 bit values using Float.floatToIntBits / Double.doubleToLongBits. Varint encoding is not applicable to these numbers due to their different bit structure (sign-exponent-mantissa).
Strings Strings are encoded as a VarInt encoded string length followed by UTF-8 encoded string contents (I have already written about similar in-memory data compression technique in String packing part 2: converting Strings to any other objects article).

Protobuf defines support for repeated fields as well as nested messages, but these are outside the scope of this article. You can read the official documentation for more details.

Use case: storing integer value as long/double in the persistent storage

The following part of the article will describe my observations on various systems (legacy and not so much) approach to storage of integer values in long/double fields. Floating point fields are sometimes used for such purposes due to mostly historical reasons (I would not go deeper in the reasons description).

Let’s assume that we want to save 1 million of integer values in a gzipped file. We will gzip a file in order to find out how much information do we actually have in a file from a generic compression tool point of view. For this example we will call Random.nextInt( int max ) method in order to get a random value between 0 and max - 1. This is a little different from the real life, but this model can be treated as a “worst average” case.

I will write data using DataOutputStream in the first set of tests and Protobuf CodedOutputStream helper methods for the second set. Note that I will not write field number or field type as the normal Protobuf implementation does (write*NoTag methods will be used).

We will use several maximal values for this test in order to see how increased number of bits in the information we generate affects the output data size. The second factor to account is the increase in the number of possible values which make it more difficult for the compression tool to find any repeating sequences.

Max value: 100 1 000 10 000 100 000
long, no protobuf 1366K 1891K 2407K 2899K
long, protobuf 820K 1465K 1869K 2377K
double, no protobuf 1511K 1948K 2563K 3133K
double, protobuf 1403K 1915K 2563K 3053K

There are 2 independent results here:

  • If your integer numbers are not completely random, it worth to apply VarInt encoding on them in many cases. Your space savings will increase when most of your values will be relatively small. We will check VarInt performance below.
  • Results for double encoding are different despite the fact we always write 8 byte per value in both CodedOutputStream and DataOutputStream. The difference is in the byte order: it is big endian for DataOutputStream and little endian for CodedOutputStream (both are fixed – if you want some choice – use a ByteBuffer).

In order to understand the difference between big and little endian byte orders, take a look at the little endian order in CodedOutputStream.writeRawLittleEndian64 and the big endian order in DataOutputStream.writeLong:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/** CodedOutputStream: Write a little-endian 64-bit integer. */
public void writeRawLittleEndian64(final long value) throws IOException {
    writeRawByte((int)(value      ) & 0xFF);
    writeRawByte((int)(value >>  8) & 0xFF);
    writeRawByte((int)(value >> 16) & 0xFF);
    writeRawByte((int)(value >> 24) & 0xFF);
    writeRawByte((int)(value >> 32) & 0xFF);
    writeRawByte((int)(value >> 40) & 0xFF);
    writeRawByte((int)(value >> 48) & 0xFF);
    writeRawByte((int)(value >> 56) & 0xFF);
}
 
/** DataOutputStream: Write a big-endian 64-bit integer. */
public final void writeLong(long v) throws IOException {
    writeBuffer[0] = (byte)(v >>> 56);
    writeBuffer[1] = (byte)(v >>> 48);
    writeBuffer[2] = (byte)(v >>> 40);
    writeBuffer[3] = (byte)(v >>> 32);
    writeBuffer[4] = (byte)(v >>> 24);
    writeBuffer[5] = (byte)(v >>> 16);
    writeBuffer[6] = (byte)(v >>>  8);
    writeBuffer[7] = (byte)(v >>>  0);
    out.write(writeBuffer, 0, 8);
    incCount(8);
}
/** CodedOutputStream: Write a little-endian 64-bit integer. */
public void writeRawLittleEndian64(final long value) throws IOException {
    writeRawByte((int)(value      ) & 0xFF);
    writeRawByte((int)(value >>  8) & 0xFF);
    writeRawByte((int)(value >> 16) & 0xFF);
    writeRawByte((int)(value >> 24) & 0xFF);
    writeRawByte((int)(value >> 32) & 0xFF);
    writeRawByte((int)(value >> 40) & 0xFF);
    writeRawByte((int)(value >> 48) & 0xFF);
    writeRawByte((int)(value >> 56) & 0xFF);
}

/** DataOutputStream: Write a big-endian 64-bit integer. */
public final void writeLong(long v) throws IOException {
    writeBuffer[0] = (byte)(v >>> 56);
    writeBuffer[1] = (byte)(v >>> 48);
    writeBuffer[2] = (byte)(v >>> 40);
    writeBuffer[3] = (byte)(v >>> 32);
    writeBuffer[4] = (byte)(v >>> 24);
    writeBuffer[5] = (byte)(v >>> 16);
    writeBuffer[6] = (byte)(v >>>  8);
    writeBuffer[7] = (byte)(v >>>  0);
    out.write(writeBuffer, 0, 8);
    incCount(8);
}

VarInt encoding performance

Let’s see how fast is the varint encoding algorithm for long values. We will encode all long values in the range [0; 50M) into a byte[]. By using an underlying byte[] we will avoid any underlying OutputStream overhead.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public static void main( String[] args ) throws IOException {
    testEncodeSpeed( 20000 );
    testEncodeSpeed( 50000000 );
}
 
private static void testEncodeSpeed( final int count ) throws IOException {
    final byte[] buffer = new byte[ count * 8 ];
    final CodedOutputStream os = CodedOutputStream.newInstance( buffer );
    final long start = System.currentTimeMillis();
    for ( long i = 0; i < count; ++i )
        os.writeInt64NoTag( i );
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Time to encode " + count + " values = " + ( time / 1000.0 ) + " sec" );
    System.out.println( "Space consumed = " + ( buffer.length - os.spaceLeft() ) / 1024.0 + " K" );
}
public static void main( String[] args ) throws IOException {
    testEncodeSpeed( 20000 );
    testEncodeSpeed( 50000000 );
}

private static void testEncodeSpeed( final int count ) throws IOException {
    final byte[] buffer = new byte[ count * 8 ];
    final CodedOutputStream os = CodedOutputStream.newInstance( buffer );
    final long start = System.currentTimeMillis();
    for ( long i = 0; i < count; ++i )
        os.writeInt64NoTag( i );
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Time to encode " + count + " values = " + ( time / 1000.0 ) + " sec" );
    System.out.println( "Space consumed = " + ( buffer.length - os.spaceLeft() ) / 1024.0 + " K" );
}

My slow Core i5-3317 (1.7 Ghz) managed to encode 50 million long values in 1.4 sec, which is approximately 285MB/sec. The output data size for 50 million long values is about 193 MB, which gives us a double reduction in data size at the extremely low CPU cost.

Recommended reading

If you are interested in other bit-processing magic tricks, take a look at the new edition of "Hacker's Delight" by Henry S. Warren.

Summary

  • Always try to upgrade to integer datatypes if you have found that some double/float fields in your serializable collections contain only integer values. This will increase the compression ratio of the general purpose algorithms on your binary data.
  • Try to use Google protobuf or any other similar encoding for your integer data if a noticeable part of your values happen to be small (either non-negative or by absolute value). You will get a noticeable data size reduction at a very low CPU cost, which will help you to store and later read a higher number of messages per time unit.