Tag Archives: memory optimization

Protobuf data encoding for numeric datatypes

by Mikhail Vorontsov

This article will give you a short overview of Google protobuf varint encoding and its applicability to financial data processing.

Google protobuf defines the data serialization format useful for both data storage and RPC calls. It is widely used in Google itself (according to Protobuf documentation). It defines serialization formats for all datatypes ( check this page for more details ), but here is the overview:

Protobuf encoding for most used datatypes

Datatypes Description
Unsigned integral types (int32, int64, bool, enum)

Protobuf relies on the idea that average data contains more small numbers rather than large ones. Varint encoding contains a number of bytes. The most significant bit of each byte (MSB) tells if it is the last byte in the encoding (0) or there are more bytes (1). The first byte in the sequence contains the lowest 7 bits of the original number, the next byte contains the next 7 bits (bits 7-13) and so on. We stop encoding when there are no more set bits (=1) left in the number. For example, 34 (22 hex) will be encoded as 0 010 0010 = 22 hex = 34 in this scheme. On the other hand, 255 (1111 1111) will be encoded in 2 byte sequence: 1111 1111, 0000 0001 – first byte contains 7 lowest bits of an original number and the MSB tells us that there will be one more byte. Second byte contains the bit 7 of an original number. MSB of the second byte is set to zero – no more bytes are expected.

You may notice that this encoding is similar (but not identical) to UTF-8 encoding. It favors the small values over the big values, which is the case of most real world data applications (for example, volume traded in the single trade on the stock exchange will tend do be small).

Signed integral types You can not apply the previous encoding scheme to the fields that can get negative. All negative numbers have the highest bit set, so you will always have to spend the maximal amount of space for negative values encoding. Due to this problem, Protobuf uses a ZigZag encoding, which transforms a signed 32/64 bit number into an unsigned one: 0 is converted into 0, -1 into 1, 1 into 2, -2 -> 3, 2 -> 4 and so on. This scheme still favors small (by their absolute value) numbers by sacrificing just a single bit. A number converted with ZigZag is encoded using VarInt encoding afterwards.
Floating point numbers These numbers are written as fixed length 32/64 bit values using Float.floatToIntBits / Double.doubleToLongBits. Varint encoding is not applicable to these numbers due to their different bit structure (sign-exponent-mantissa).
Strings Strings are encoded as a VarInt encoded string length followed by UTF-8 encoded string contents (I have already written about similar in-memory data compression technique in String packing part 2: converting Strings to any other objects article).

Protobuf defines support for repeated fields as well as nested messages, but these are outside the scope of this article. You can read the official documentation for more details.

Continue reading

Use case: Optimizing memory footprint of a read only csv file (Trove, Unsafe, ByteBuffer, data compression)

by Mikhail Vorontsov

Suppose your application has to obtain some data from an auxiliary data source, which could be a csv file. Your csv file will contain several fields, one of those is used as a field ID. You need to keep all that file in memory and provide fast access to records by an ID field. Your additional target is to consume as little memory as possible, while keeping access cost as low as possible. In this article we will process fake person records. Here is an example:

{id=idnum10, surname=Smith10, address=10 One Way Road, Springfield, NJ, DOB=01/11/1965, names=John Paul 10 Ringo}
    

All these records are generated by the following class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
private static class DataGenerator
{
    private int cnt = 1;
 
    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }
 
    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}
private static class DataGenerator
{
    private int cnt = 1;

    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }

    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}

Simple approach – map of maps

We always have to start from something simple and easy to support. In this case it may be a map of maps: outer map is indexed by ID field and th inner ones are indexed by field names.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );
 
        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }
 
        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );

        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }

        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    

For testing purposes, all storage implementations will either implement Storage interface or extend SimpleStorage class. You will see the purpose of pack method in the more advanced examples.

1
2
3
4
5
6
7
8
9
10
11
12
13
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}
 
public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}

public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}

All storage implementations will be tested by the following method:

1
2
3
4
5
6
7
8
9
10
11
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}

For every implementation in this article we will try to create 1 million entries. SimpleMapStorage consumes 706 Mb to store 1M records. The actual data size is about 82 Mb, which means that this simple implementation wastes about 85% of consumed memory. Yes, straightforward solutions for big data storage do not work well in Java…

Continue reading

Memory introspection using sun.misc.Unsafe and reflection

by Mikhail Vorontsov

It is useful for a serious Java developer to realize how much memory is occupied by one or another Java object. You may have heard that we live in a world there memory is not an issue anymore. This may be true for your text editor ( though, try to open a large document with tons of embedded images and charts and see how much memory will be consumed by your favourite editor ). This may be true for a dedicated server software (at least, until your business would grow to a bigger market or you will run another piece of software on the same server). This is may even be true for a cloud-based software, if you are rich enough in order to pay a premium for a top-class server hardware.

Still, in the real world your software will once reach a point where it makes sense to spend money in its optimization rather than trying to obtain an even better hardware (currently the most you can get on a commodity class server is 64G RAM). At this point you will have to analyze your application and find which data structures consume most of application memory. The best tool for such task is a good profiler, but you can start with a cheaper approach of analyzing your objects inside your code. This article describes an Oracle JDK based ClassIntrospector class, which will allow you to analyze your application memory consumption.

I have already mentioned the Java object memory layout in the String packing part 1: converting characters to bytes article. For example, I have written that a 28 character long String should occupy 104 bytes before Java 1.7.0_06. To be honest, at the time of writing of that article I used my profiler to get a proof for my calculations. Now it is about time to implement a Java object introspector using pure Java and Oracle JDK specific sun.misc.Unsafe class.

We will use the following sun.misc.Unsafe methods:

Continue reading

String packing part 2: converting Strings to any other objects

by Mikhail Vorontsov

As we have found out in the previous string packing article, even an empty String will occupy at least 48 bytes. In some cases we may expect that some String variables will contain values which could be easily converted to primitive values. Unfortunately, other values of these variables may not be converted, so you will still need a String.

You may notice it either for single fields of some objects or you may have some collection of strings, like Map<Integer, String> – it doesn’t matter: you may apply this chapter ideas for both cases.

In order to save memory you need to replace Strings with Objects. You need a conversion method, which will be applied to all Strings you are trying to store and return either an original String or any other less memory-hungry Object. When you will need your original string back, all you need is just to call toString() method on your stored Object (probably, supporting null values as well). All other code must access your fields via such set/get methods.

The easiest types of replacement objects to support are Character for strings with length = 1, Integer for shorter numbers (because it has better caching support than other wrapper types), Long for longer numbers and Double for decimal point numbers.

There are still a few easy to forget pitfalls. First of all, all integer number conversion methods, like Integer.valueOf, support strings starting with zeroes, like “01234”. They will skip leading zeroes and convert the rest of the input string (result will be 1234 in the example). Conversion of 1234 back to string will return “1234”, of course. That’s why we can’t use numbers starting with ‘0’ (except zero itself).

Continue reading

Changes to String internal representation made in Java 1.7.0_06

by Mikhail Vorontsov

This post was updated on 19 Nov 2013 in order to reflect Java 8 changes.


This post was updated on 28 Nov 2013 in order to reflect Java 7u40 changes (thanks to Sunny Chan and his colleagues for pointing my attention at this JDK update).

Sharing an underlying char[]

An original String implementation has 4 non static field: char[] value with string characters, int offset and int count with an index of the first character to use from value and a number of characters to use and int hash with a cached value of a String hash code. As you can see, in a very large number of cases a String will have offset = 0 and count = value.length. The only exception to this rule were the strings created via String.substring calls and all API calls using this method internally (like Pattern.split).

String.substring created a String, which shared an internal char[] value with an original String, which allowed you:

  1. To save some memory by sharing character data
  2. To run String.substring in a constant time ( O(1) )

At the same time such feature was a source of a possible memory leak: if you extract a tiny substring from an original huge string and discard that original string, you will still have a live reference to the underlying huge char[] value taken from an original String. The only way to avoid it was to call a new String( String ) constructor on such string – it made a copy of a required section of underlying char[], thus unlinking your shorter string from its longer “parent”.

From Java 1.7.0_06 (as well as in current versions of Java 8 – Nov 13) offset and count fields were removed from a String. This means that you can’t share a part of an underlying char[] value anymore. Now you can forget about a memory leak described above and never ever use new String(String) constructor anymore. As a drawback, you now have to remember that String.substring has now a linear complexity instead of a constant one.

Continue reading

hashCode method performance tuning

by Mikhail Vorontsov

In this chapter we will discuss various implications of hashCode method implementation on application performance.

The main purpose of hashCode method is to allow an object to be a key in the hash map or a member of a hash set. In this case an object should also implement equals(Object) method, which is consistent with hashCode implementation:

  • If a.equals(b) then a.hashCode() == b.hashCode()
  • If hashCode() was called twice on the same object, it should return the same result provided that the object was not changed

hashCode from performance point of view

From the performance point of view, the main objective for your hashCode method implementation is to minimize the number of objects sharing the same hash code. All JDK hash based collections store their values in an array. Hash code is used to calculate an initial lookup position in this array. After that equals is used to compare given value with values stored in the internal array. So, if all values have distinct hash codes, this will minimize the possibility of hash collisions. On the other hand, if all values will have the same hash code, hash map (or set) will degrade into a list with operations on it having O(n2) complexity.

For more details, read about collision resolution in hash maps. JDK is using a method called open addressing, but there is another method called “chaining” – all key-value pairs with the same hash code are stored in a linked list.

Continue reading

String packing part 1: converting characters to bytes

by Mikhail Vorontsov

31 December 2014 update: minor cleanup at the beginning of this article – we are getting further away from Java 6.

From time to time you may need to keep a huge number of strings in memory. Each String contains a character array – actual characters of that String and

Prior to Java 7 update 6 3 int fields – hash code, offset of that string first character in the array and length of the string
Java 7 from update 6 (but before Java 8) 2 int fields – 2 hash codes (normal and murmur)
From Java 8 1 int field – hash code

Let’s see if we can optimize a theoretical string object in terms of memory. It is easy to notice that offset and length are redundant in case when a string was not created in String.substring method (internal char array is not shared with any other String – this is a default behavior since Java 7 update 6). Hash code can also be calculated on each hashCode call instead of caching. Only char array seems to be required in the general case.

Nevertheless, we can avoid using char array. A lot of programs process all its data in only one encoding at a time (or at least, most of data in one encoding). We can convert characters to bytes on the fly into that encoding, so we may keep array of bytes instead of array of chars. For a more general case, UTF-8 may be considered as a safe encoding. Could we do any better? In order to answer this question, we need to review Java object fields memory layout.

32 and 64 bit modes

JVM runs by default using 32 bit object references. JVM will switch to 64 bit object references in one of 2 cases:

  • A JVM is started with over 32G heap
  • -XX:-UseCompressedOops is turned off (you should never do this for production systems)

There are just 2 differences between these modes in terms of memory layout:

  1. Object references occupy 4 bytes in 32 bit mode and 8 bytes in 64 bit mode (object references include references to arrays).
  2. Java object header occupy 12 bytes in 32 bit mode and 16 bytes in 64 bit mode (due to embedded Class reference)

Java object memory layout

All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by System.identityHashCode method). Arrays have 4 more bytes (one int field) containing array length. Object header is ended with a reference to an object Class (this reference occupy 4 bytes on under 32G heaps and 8 bytes on over 32G heaps). These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example, chars should be aligned by 2 bytes boundary). Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on) in 32 bit mode (a common case we will discuss in this article) or N*8 bytes in 64 bit mode. In this case 100% memory will contain useful data.

Continue reading

java.util.ArrayList performance guide

by Mikhail Vorontsov

We will discuss most of possible ArrayList performance problems in this article. ArrayList methods will be divided into several groups and their performance will be discussed.

ArrayList is a general list implementation suitable for most use cases. It is backed by an Object array, which size is dynamically adjusted while user adds or removes elements from the list. If you need a resizable list of primitive type values, you should read a Trove library article.

Adding elements to the list

There are 2 pairs of methods used to add elements:

  • add(E)
  • add(int, E)
  • addAll(List<E>)
  • addAll(int, List<E>)

Single argument methods are adding new elements to the list tail. They are the fastest ones. Methods specifying insertion position have to copy all array elements to the right from insertion point by System.arraycopy call, that’s why these methods have O(n) complexity (performance penalty is small if new element is added near the tail, but getting larger when insertion point moves towards the head). This means that you shouldn’t add too many elements to the head of big ArrayList. Adding 1M elements to the head of List<String> took 125 seconds on my computer. Adding 0.5M elements took 30 seconds, which proves that n calls of a method with O(n) complexity have O(n2) complexity (adding 2 times more elements took 4 times longer).

If there is not enough space in the array to fit new elements, it should be extended. Extension logic was changed in Java 7: previously new array size was oldSize * 3 / 2 + 1, but it is oldSize * 3 in Java 7.

Continue reading

java.util.Date, java.util.Calendar and java.text.SimpleDateFormat performance

by Mikhail Vorontsov

In this post we will discuss memory footprint and speed of various classes used to represent and parse dates and times in core Java:

  • java.util.Date – Java 1.0 style datetime storage class, now absolutely useless
  • java.util.Calendar – Java 1.1+ style datetime storage class, supports time zones and locales, useful for date calculations
  • java.text.SimpleDateFormat – most commonly used datetime parser

Datetime storage classes

There are 2 core Java classes designed to store dates and times: java.util.Date and java.util.Calendar. Despite the fact the former is used in a lot of APIs, it is actually just an wrapper over a long timestamp field. It is also important to note that java.util.Date is not immutable, what further limits its use.

The latter one, java.util.Calendar, was added in Java 1.1 in order to support internationalization. Besides, it has a good support of datetime arithmetic operations, like adding 30 days to a given date, supporting all daylight savings issues as well. So, this class should be used for most of modern software datetime operations.

Still, when it comes to storing millions of datetimes, both these classes are a rather bad choice. Simply remember the footprint of instances of these classes in most common conditions:

java.util.Date java.util.GregorianCalendar
24 bytes 448 bytes

Yes, 448 bytes to keep a timestamp, locale and time zone. Take a look at your IDE debugger to see how much information is stored inside a java.util.GregorianCalendar – the most commonly used subclass of java.util.Calendar.

Continue reading

java.lang.Byte, Short, Integer, Long, Character (boxing and unboxing)

by Mikhail Vorontsov

In this article we will discuss how boxing and unboxing is implemented in Java 1.5+ and what implications has such implementation. There are no differences between Java 6 and 7 implementations.

Boxing is a process of conversion of primitive type variable into java.lang.Number subclass (Java.lang.Byte, Short, Integer, Long, Float, Double). Boxing is done via valueOf method. For example, for Integer it is:

1
static Integer valueOf( int i )
static Integer valueOf( int i )

Unboxing can be done from any of java.lang.Number subclasses to any primitive type. This is done via set of methods defined in java.lang.Number:

1
2
3
4
5
6
7
8
9
10
public abstract int intValue();
public abstract long longValue();
public abstract float floatValue();
public abstract double doubleValue();
public byte byteValue() {
    return (byte)intValue();
}
public short shortValue() {
    return (short)intValue();
}
public abstract int intValue();
public abstract long longValue();
public abstract float floatValue();
public abstract double doubleValue();
public byte byteValue() {
    return (byte)intValue();
}
public short shortValue() {
    return (short)intValue();
}

Implications on performance

There is a common mistake done while converting String to a primitive value type. Which method is better to use: parse*(String) or valueOf(String)? The answer is parse* (parseInt, parseLong and so on). It returns a primitive value. Remember that all valueOf methods return an Object (though it may be cached sometimes – see below).

Continue reading