Tag Archives: data compression

Creating an exception in Java is very slow

by Mikhail Vorontsov


29 June 2014 update – now using JMH for testing. Added 2 ways to avoid the cost of exceptions. Made some editorial changes to the original text to reflect JMH usage.

Filling in the stack trace is slow…

Creating an exception in Java is a very slow operation. Expect that throwing an exception will cost you around 1-5 microseconds. Nearly all this time is spent on filling in the exception thread stack. The deeper the stack trace is, the more time it will take to populate it.

Usually we throw an exception in case of unexpected problems. This means that we don’t expect exceptions to be thrown at a rate of thousands per second per process. But sometimes you will encounter a method which uses exceptions for more likely events. We have already seen a good (actually bad) example of such behavior in Base 64 encoding and decoding performance article: sun.misc.BASE64Decoder is extremely slow due to using an exception for “give me more data” requests:

at java.lang.Throwable.fillInStackTrace(Throwable.java:-1)
at java.lang.Throwable.fillInStackTrace(Throwable.java:782)
- locked <0x6c> (a sun.misc.CEStreamExhausted)
at java.lang.Throwable.<init>(Throwable.java:250)
at java.lang.Exception.<init>(Exception.java:54)
at java.io.IOException.<init>(IOException.java:47)
at sun.misc.CEStreamExhausted.<init>(CEStreamExhausted.java:30)
at sun.misc.BASE64Decoder.decodeAtom(BASE64Decoder.java:117)
at sun.misc.CharacterDecoder.decodeBuffer(CharacterDecoder.java:163)
at sun.misc.CharacterDecoder.decodeBuffer(CharacterDecoder.java:194)

You may encounter the same problem if you try to run pack method from String packing part 2: converting Strings to any other objects with a string starting with a digit, but followed by letters. Let’s see how long does it take to pack ‘12345’ and ‘12345a’ with that method:

Benchmark                        (m_param)   Mode   Samples         Mean   Mean error    Units
t.StringPacking2Tests.testPack      12345a  thrpt        10        0.044        0.000   ops/us
t.StringPacking2Tests.testPack       12345  thrpt        10        7.934        0.154   ops/us

As you can see, we were able to convert “12345” from String about 200 times faster than “12345a”. Most of processing time is again being spent filling in stack traces:

at java.lang.Throwable.fillInStackTrace(Throwable.java:-1)
at java.lang.Throwable.fillInStackTrace(Throwable.java:782)
- locked <0x87> (a java.lang.NumberFormatException)
at java.lang.Throwable.<init>(Throwable.java:265)
at java.lang.Exception.<init>(Exception.java:66)
at java.lang.RuntimeException.<init>(RuntimeException.java:62)
at java.lang.IllegalArgumentException.<init>(IllegalArgumentException.java:53)
at java.lang.NumberFormatException.<init>(NumberFormatException.java:55)
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:441)
at java.lang.Long.valueOf(Long.java:540)
at tests.StringPacking2Tests.pack(StringPacking2Tests.java:69)
...

Continue reading

An overview of memory saving techniques in Java

by Mikhail Vorontsov

This article will give you the general advices on memory consumption optimization in Java.

Memory usage optimization is the most important type of optimization in Java. Current systems are limited by memory access times rather than by CPU frequency (otherwise, why CPU producers are implementing all these L1, L2 and L3 caches?). It means that by reducing your application memory footprint you will most likely improve your program data processing speed by making your CPU to wait for smaller amount of data. Now let’s get back to Java.

General Java memory layout information

First of all, we have to revise the memory layout of Java objects: any Java Object occupies at least 16 bytes, 12 out of which are occupied by a Java object header. Besides that, all Java objects are aligned by 8 bytes boundary. It means that, for example, an object containing 2 fields: int and byte will occupy not 17 (12 + 4 + 1), but 24 bytes (17 aligned by 8 bytes).

Each Object reference occupies 4 bytes if the Java heap is under 32G and XX:+UseCompressedOops is turned on (it is turned on by default in the recent Oracle JVM versions). Otherwise, Object references occupy 8 bytes.

All primitive data types occupy their exact byte size:

byte, boolean 1 byte
short, char 2 bytes
int, float 4 bytes
long, double 8 bytes

In essence, this information is sufficient for Java memory optimization. But it will be more convenient if you will be aware of arrays / String / numeric wrappers memory consumption.

Continue reading

Protobuf data encoding for numeric datatypes

by Mikhail Vorontsov

This article will give you a short overview of Google protobuf varint encoding and its applicability to financial data processing.

Google protobuf defines the data serialization format useful for both data storage and RPC calls. It is widely used in Google itself (according to Protobuf documentation). It defines serialization formats for all datatypes ( check this page for more details ), but here is the overview:

Protobuf encoding for most used datatypes

Datatypes Description
Unsigned integral types (int32, int64, bool, enum)

Protobuf relies on the idea that average data contains more small numbers rather than large ones. Varint encoding contains a number of bytes. The most significant bit of each byte (MSB) tells if it is the last byte in the encoding (0) or there are more bytes (1). The first byte in the sequence contains the lowest 7 bits of the original number, the next byte contains the next 7 bits (bits 7-13) and so on. We stop encoding when there are no more set bits (=1) left in the number. For example, 34 (22 hex) will be encoded as 0 010 0010 = 22 hex = 34 in this scheme. On the other hand, 255 (1111 1111) will be encoded in 2 byte sequence: 1111 1111, 0000 0001 – first byte contains 7 lowest bits of an original number and the MSB tells us that there will be one more byte. Second byte contains the bit 7 of an original number. MSB of the second byte is set to zero – no more bytes are expected.

You may notice that this encoding is similar (but not identical) to UTF-8 encoding. It favors the small values over the big values, which is the case of most real world data applications (for example, volume traded in the single trade on the stock exchange will tend do be small).

Signed integral types You can not apply the previous encoding scheme to the fields that can get negative. All negative numbers have the highest bit set, so you will always have to spend the maximal amount of space for negative values encoding. Due to this problem, Protobuf uses a ZigZag encoding, which transforms a signed 32/64 bit number into an unsigned one: 0 is converted into 0, -1 into 1, 1 into 2, -2 -> 3, 2 -> 4 and so on. This scheme still favors small (by their absolute value) numbers by sacrificing just a single bit. A number converted with ZigZag is encoded using VarInt encoding afterwards.
Floating point numbers These numbers are written as fixed length 32/64 bit values using Float.floatToIntBits / Double.doubleToLongBits. Varint encoding is not applicable to these numbers due to their different bit structure (sign-exponent-mantissa).
Strings Strings are encoded as a VarInt encoded string length followed by UTF-8 encoded string contents (I have already written about similar in-memory data compression technique in String packing part 2: converting Strings to any other objects article).

Protobuf defines support for repeated fields as well as nested messages, but these are outside the scope of this article. You can read the official documentation for more details.

Continue reading

Use case: Optimizing memory footprint of a read only csv file (Trove, Unsafe, ByteBuffer, data compression)

by Mikhail Vorontsov

Suppose your application has to obtain some data from an auxiliary data source, which could be a csv file. Your csv file will contain several fields, one of those is used as a field ID. You need to keep all that file in memory and provide fast access to records by an ID field. Your additional target is to consume as little memory as possible, while keeping access cost as low as possible. In this article we will process fake person records. Here is an example:

{id=idnum10, surname=Smith10, address=10 One Way Road, Springfield, NJ, DOB=01/11/1965, names=John Paul 10 Ringo}
    

All these records are generated by the following class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
private static class DataGenerator
{
    private int cnt = 1;
 
    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }
 
    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}
private static class DataGenerator
{
    private int cnt = 1;

    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }

    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}

Simple approach – map of maps

We always have to start from something simple and easy to support. In this case it may be a map of maps: outer map is indexed by ID field and th inner ones are indexed by field names.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );
 
        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }
 
        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );

        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }

        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    

For testing purposes, all storage implementations will either implement Storage interface or extend SimpleStorage class. You will see the purpose of pack method in the more advanced examples.

1
2
3
4
5
6
7
8
9
10
11
12
13
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}
 
public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}

public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}

All storage implementations will be tested by the following method:

1
2
3
4
5
6
7
8
9
10
11
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}

For every implementation in this article we will try to create 1 million entries. SimpleMapStorage consumes 706 Mb to store 1M records. The actual data size is about 82 Mb, which means that this simple implementation wastes about 85% of consumed memory. Yes, straightforward solutions for big data storage do not work well in Java…

Continue reading

String packing part 2: converting Strings to any other objects

by Mikhail Vorontsov

As we have found out in the previous string packing article, even an empty String will occupy at least 48 bytes. In some cases we may expect that some String variables will contain values which could be easily converted to primitive values. Unfortunately, other values of these variables may not be converted, so you will still need a String.

You may notice it either for single fields of some objects or you may have some collection of strings, like Map<Integer, String> – it doesn’t matter: you may apply this chapter ideas for both cases.

In order to save memory you need to replace Strings with Objects. You need a conversion method, which will be applied to all Strings you are trying to store and return either an original String or any other less memory-hungry Object. When you will need your original string back, all you need is just to call toString() method on your stored Object (probably, supporting null values as well). All other code must access your fields via such set/get methods.

The easiest types of replacement objects to support are Character for strings with length = 1, Integer for shorter numbers (because it has better caching support than other wrapper types), Long for longer numbers and Double for decimal point numbers.

There are still a few easy to forget pitfalls. First of all, all integer number conversion methods, like Integer.valueOf, support strings starting with zeroes, like “01234”. They will skip leading zeroes and convert the rest of the input string (result will be 1234 in the example). Conversion of 1234 back to string will return “1234”, of course. That’s why we can’t use numbers starting with ‘0’ (except zero itself).

Continue reading

hashCode method performance tuning

by Mikhail Vorontsov

In this chapter we will discuss various implications of hashCode method implementation on application performance.

The main purpose of hashCode method is to allow an object to be a key in the hash map or a member of a hash set. In this case an object should also implement equals(Object) method, which is consistent with hashCode implementation:

  • If a.equals(b) then a.hashCode() == b.hashCode()
  • If hashCode() was called twice on the same object, it should return the same result provided that the object was not changed

hashCode from performance point of view

From the performance point of view, the main objective for your hashCode method implementation is to minimize the number of objects sharing the same hash code. All JDK hash based collections store their values in an array. Hash code is used to calculate an initial lookup position in this array. After that equals is used to compare given value with values stored in the internal array. So, if all values have distinct hash codes, this will minimize the possibility of hash collisions. On the other hand, if all values will have the same hash code, hash map (or set) will degrade into a list with operations on it having O(n2) complexity.

For more details, read about collision resolution in hash maps. JDK is using a method called open addressing, but there is another method called “chaining” – all key-value pairs with the same hash code are stored in a linked list.

Continue reading

String packing part 1: converting characters to bytes

by Mikhail Vorontsov

31 December 2014 update: minor cleanup at the beginning of this article – we are getting further away from Java 6.

From time to time you may need to keep a huge number of strings in memory. Each String contains a character array – actual characters of that String and

Prior to Java 7 update 6 3 int fields – hash code, offset of that string first character in the array and length of the string
Java 7 from update 6 (but before Java 8) 2 int fields – 2 hash codes (normal and murmur)
From Java 8 1 int field – hash code

Let’s see if we can optimize a theoretical string object in terms of memory. It is easy to notice that offset and length are redundant in case when a string was not created in String.substring method (internal char array is not shared with any other String – this is a default behavior since Java 7 update 6). Hash code can also be calculated on each hashCode call instead of caching. Only char array seems to be required in the general case.

Nevertheless, we can avoid using char array. A lot of programs process all its data in only one encoding at a time (or at least, most of data in one encoding). We can convert characters to bytes on the fly into that encoding, so we may keep array of bytes instead of array of chars. For a more general case, UTF-8 may be considered as a safe encoding. Could we do any better? In order to answer this question, we need to review Java object fields memory layout.

32 and 64 bit modes

JVM runs by default using 32 bit object references. JVM will switch to 64 bit object references in one of 2 cases:

  • A JVM is started with over 32G heap
  • -XX:-UseCompressedOops is turned off (you should never do this for production systems)

There are just 2 differences between these modes in terms of memory layout:

  1. Object references occupy 4 bytes in 32 bit mode and 8 bytes in 64 bit mode (object references include references to arrays).
  2. Java object header occupy 12 bytes in 32 bit mode and 16 bytes in 64 bit mode (due to embedded Class reference)

Java object memory layout

All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by System.identityHashCode method). Arrays have 4 more bytes (one int field) containing array length. Object header is ended with a reference to an object Class (this reference occupy 4 bytes on under 32G heaps and 8 bytes on over 32G heaps). These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example, chars should be aligned by 2 bytes boundary). Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on) in 32 bit mode (a common case we will discuss in this article) or N*8 bytes in 64 bit mode. In this case 100% memory will contain useful data.

Continue reading

Use case: compacting price field disk representation

by Mikhail Vorontsov

In this article I’ll describe a small trick you can use while working with prices (at least, in countries with rather expensive currency).

There are 2 well-known ways to store price in memory and on disk:

  • In smallest currency units of your currency, using integral type of your choice
  • Keep it as double and don’t worry about number of smallest currency units in any currency. This approach has advantage if you are writing a system which supports prices in fractions of smallest currency unit (for example, sub-penny prices on a large number of stock exchanges).

Impact on compression

If you will ever need to compress your data as a stream, when integral types have one crucial advantage: they are shorter and most prices use only a small number of bits from a corresponding datatype. Floating point types, on the other hand, have sign, exponent and mantissa at various places of a datatype binary representation (see IEEE-754 standard for details), which generally makes data compression less effective.

Continue reading

Use case: how to compact a long-to-long mapping

by Mikhail Vorontsov

In this post we will discuss what techniques could be applied to an integer-to-integer map in order to further reduce memory consumption. Trove collections will be used in this post.

Suppose we have 2 similar message streams. Each stream contains “main” messages, which define new identifiers, and “child” messages, which refer to earlier defined identifiers. Identifiers in both streams are positive long values starting from 1, but there may be some gaps in both identifier sequences.

We need to merge two such streams into one stream, keeping references between “main” and “child” messages. We are allowed to change any identifiers as long as references will not be broken.

Here is an example of how it should work. First id you see is 1 from stream A – you will map it to 1. Next comes 1 from stream B – you will map it to 2. The next one is 4 from stream B (two ids are missing in stream B) – you will map it to 3. It is followed by 2 from stream A – it will be mapped to 4. Now you’ll get 1 from stream A again – you should use previously assigned 1 in this case. It is followed by 2 from stream A – you should use previously assigned 4. And so on…

Continue reading