Use case: Optimizing memory footprint of a read only csv file (Trove, Unsafe, ByteBuffer, data compression)

by Mikhail Vorontsov

Suppose your application has to obtain some data from an auxiliary data source, which could be a csv file. Your csv file will contain several fields, one of those is used as a field ID. You need to keep all that file in memory and provide fast access to records by an ID field. Your additional target is to consume as little memory as possible, while keeping access cost as low as possible. In this article we will process fake person records. Here is an example:

{id=idnum10, surname=Smith10, address=10 One Way Road, Springfield, NJ, DOB=01/11/1965, names=John Paul 10 Ringo}
    

All these records are generated by the following class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
private static class DataGenerator
{
    private int cnt = 1;
 
    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }
 
    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}
private static class DataGenerator
{
    private int cnt = 1;

    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }

    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}

Simple approach – map of maps

We always have to start from something simple and easy to support. In this case it may be a map of maps: outer map is indexed by ID field and th inner ones are indexed by field names.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );
 
        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }
 
        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );

        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }

        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    

For testing purposes, all storage implementations will either implement Storage interface or extend SimpleStorage class. You will see the purpose of pack method in the more advanced examples.

1
2
3
4
5
6
7
8
9
10
11
12
13
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}
 
public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}

public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}

All storage implementations will be tested by the following method:

1
2
3
4
5
6
7
8
9
10
11
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}

For every implementation in this article we will try to create 1 million entries. SimpleMapStorage consumes 706 Mb to store 1M records. The actual data size is about 82 Mb, which means that this simple implementation wastes about 85% of consumed memory. Yes, straightforward solutions for big data storage do not work well in Java…

Array based entries

Initial solution stored field names as keys in the entry map. Suppose all fields in the entry are mandatory (at least they are equal to empty strings). In this case we may remember an order of fields in the entry and always store field values in this order. We will need one global array which stores order of fields and one array per record in order to store field values. This approach should save memory used to keep references to field names in the every record.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/**
 * Suppose all record fields are mandatory. In this case we may save an order of fields in the first record
 * and use it for all other records.
 * Data is stored as a map from key field value to an array of field values. Field value at position N corresponds
 * to field name in m_keys.get( N ).
 */
private static class SimpleArrayStorage extends SimpleStorage
{
    private Map<String, String[]> m_data = new HashMap<String, String[]>( 1000 );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );
 
    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
            m_keys.addAll( entry.keySet() );
        final String[] record = new String[ entry.size() ];
        int i = 0;
        for ( final String key : m_keys )
            record[ i++ ] = entry.get( key );
        m_data.put( entry.get( ID ), record );
    }
 
    public Map<String, String> getById( final String id )
    {
        final String[] record = m_data.get( id );
        if ( record == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( record.length );
        for ( int i = 0; i < record.length; ++i )
            res.put( m_keys.get( i ), record[ i ] );
        return res;
    }
}
/**
 * Suppose all record fields are mandatory. In this case we may save an order of fields in the first record
 * and use it for all other records.
 * Data is stored as a map from key field value to an array of field values. Field value at position N corresponds
 * to field name in m_keys.get( N ).
 */
private static class SimpleArrayStorage extends SimpleStorage
{
    private Map<String, String[]> m_data = new HashMap<String, String[]>( 1000 );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );

    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
            m_keys.addAll( entry.keySet() );
        final String[] record = new String[ entry.size() ];
        int i = 0;
        for ( final String key : m_keys )
            record[ i++ ] = entry.get( key );
        m_data.put( entry.get( ID ), record );
    }

    public Map<String, String> getById( final String id )
    {
        final String[] record = m_data.get( id );
        if ( record == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( record.length );
        for ( int i = 0; i < record.length; ++i )
            res.put( m_keys.get( i ), record[ i ] );
        return res;
    }
}

This implementation consumes 495 Mb, which is a good improvement compared to original 705 Mb, and, I think, a lot of people will be happy to stop at this point 🙂

210 Mb saved here are caused by memory-inefficient storage inside java.util.HashMap – it keeps records as Map.Entry[]. Each Map.Entry object consumes 28 bytes, we have 11 such objects per record (have you noticed capacity=8 in the HashMap constructor in DataGenerator? Default fill factor for HashMap is 0.75, that’s why we have 8 / 0.75 = 11 records per map) and 1M records. The total memory consumption of Map.Entry objects is about 300M (we don’t count fields themselves here).

On the contrary, String reference in the array occupies 8 bytes and we have exactly 5 such references per record.

Array based entry without ID field

We can make a small improvement for the previous implementation: ID field value is stored in both index map and each record array. We can skip it in the array, thus saving one reference per records. Unfortunately, it will save us only expected 8 Mb: the memory consumption of this implementation is 487 Mb.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/**
 * Same as SimpleArrayStorage, but we now don't save ID field value in an array, because it is available
 * as a map key.
 */
private static class SimpleArrayStorageNoId extends SimpleStorage
{
    private Map<String, String[]> m_data = new HashMap<String, String[]>( 1000 );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );
 
    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            for ( final String key : entry.keySet() )
                if ( !key.equals( ID ) )
                    m_keys.add( key );
        }
        final String[] record = new String[ entry.size() - 1 ];
        int i = 0;
        for ( final String key : m_keys )
            record[ i++ ] = entry.get( key );
        m_data.put( entry.get( ID ), record );
    }
 
    public Map<String, String> getById( final String id )
    {
        final String[] record = m_data.get( id );
        if ( record == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( record.length + 1 );
        for ( int i = 0; i < record.length; ++i )
            res.put( m_keys.get( i ), record[ i ] );
        res.put( ID, id );
        return res;
    }
}
/**
 * Same as SimpleArrayStorage, but we now don't save ID field value in an array, because it is available
 * as a map key.
 */
private static class SimpleArrayStorageNoId extends SimpleStorage
{
    private Map<String, String[]> m_data = new HashMap<String, String[]>( 1000 );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );

    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            for ( final String key : entry.keySet() )
                if ( !key.equals( ID ) )
                    m_keys.add( key );
        }
        final String[] record = new String[ entry.size() - 1 ];
        int i = 0;
        for ( final String key : m_keys )
            record[ i++ ] = entry.get( key );
        m_data.put( entry.get( ID ), record );
    }

    public Map<String, String> getById( final String id )
    {
        final String[] record = m_data.get( id );
        if ( record == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( record.length + 1 );
        for ( int i = 0; i < record.length; ++i )
            res.put( m_keys.get( i ), record[ i ] );
        res.put( ID, id );
        return res;
    }
}

Packing field values into UTF-8 encoded byte[]

Our next step is to “compress” our field values and convert them from inefficient String representation into something more compact. You could find various ideas in part 1 and part 2 of string packing articles. For simplicity, we will use the most general of them (and often the most efficient) – converting String objects into UTF-8 encoded byte[]. This transformation is loseless and efficient for a large number of languages.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/**
 * Same as SimpleArrayStorageNoId, but we store all field values as byte[] in UTF-8 encoding
 */
private static class SimpleArrayStorageNoIdAsUtf8 extends SimpleStorage
{
    private Map<String, byte[][]> m_data = new HashMap<String, byte[][]>( 1000 );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );
 
    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            for ( final String key : entry.keySet() )
                if ( !key.equals( ID ) )
                    m_keys.add( key );
        }
        final byte[][] record = new byte[ entry.size() - 1 ][];
        int i = 0;
        for ( final String key : m_keys )
            record[ i++ ] = entry.get( key ).getBytes( UTF8 );
        m_data.put( entry.get( ID ), record );
    }
 
    public Map<String, String> getById( final String id )
    {
        final byte[][] record = m_data.get( id );
        if ( record == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( record.length + 1 );
        for ( int i = 0; i < record.length; ++i )
            res.put( m_keys.get( i ), new String( record[ i ], UTF8 ) );
        res.put( ID, id );
        return res;
    }
}
/**
 * Same as SimpleArrayStorageNoId, but we store all field values as byte[] in UTF-8 encoding
 */
private static class SimpleArrayStorageNoIdAsUtf8 extends SimpleStorage
{
    private Map<String, byte[][]> m_data = new HashMap<String, byte[][]>( 1000 );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );

    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            for ( final String key : entry.keySet() )
                if ( !key.equals( ID ) )
                    m_keys.add( key );
        }
        final byte[][] record = new byte[ entry.size() - 1 ][];
        int i = 0;
        for ( final String key : m_keys )
            record[ i++ ] = entry.get( key ).getBytes( UTF8 );
        m_data.put( entry.get( ID ), record );
    }

    public Map<String, String> getById( final String id )
    {
        final byte[][] record = m_data.get( id );
        if ( record == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( record.length + 1 );
        for ( int i = 0; i < record.length; ++i )
            res.put( m_keys.get( i ), new String( record[ i ], UTF8 ) );
        res.put( ID, id );
        return res;
    }
}

The new implementation consumes only 292 Mb, which is 200 Mb less than previous implementation. From this point we will have to implement more aggressive changes to our storage classes if we want to approach 82 Mb, which is a binary size of our data.

Storing all records in a ByteBuffer with a separate index

The following idea is easy to implement for a read-only data: once a record was added to the collection, you can only query it. If you need to update your data, you will need a more complicated memory management.

We will store all records in a single ByteBuffer. Each record position in the ByteBuffer will be stored in the separate index Map<ID, position>. We will assume that no field could exceed 127 bytes in length (quite reasonable assumption for our example data). This will allow us to write a record as a series of fields, each of those is UTF-8 encoded and prepended by 1 byte length:

length_1 value_1 length_2 value_2 length_3 value_3

Start position of each record will be stored in the index. Using a separate length field allows us to use any type of data encoding in our algorithm compared to using a string terminator character.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/**
 * Store data in the binary buffer. We assume that buffer will be populated first and then be queried
 * in the read-only mode.
 * We will keep map from key field to each record offset in the buffer in {@code m_index}.
 * The buffer itself {@code m_data} consists of field values prepended by a single byte length field
 * (Pascal convention). For this example we assume no field could be longer than 127 bytes.
 * Each field is converted into UTF-8 encoding before serialization. It allows us to use 1 byte per char
 * for most of european languages and not to lose any data during conversions.
 */
private static class SimpleBinaryStorage extends SimpleStorage
{
    private static final int BUFFER_SIZE_STEP = 10 * 1024 * 1024;
    //map from key to buffer position
    protected Map<String, Integer> m_index = new HashMap<String, Integer>( 1000 );
 
    protected ByteBuffer m_data = ByteBuffer.allocate( BUFFER_SIZE_STEP );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    protected List<String> m_keys = new ArrayList<String>( 5 );
 
    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
            m_keys.addAll( entry.keySet() );
        final List<byte[]> fields = new ArrayList<byte[]>( entry.size() );
        int totalLen = 0;
        //generate byte representations of all fields
        for ( final String key : m_keys )
        {
            final byte[] bytes = entry.get( key ).getBytes( UTF8 );
            totalLen += bytes.length + 1;
            fields.add( bytes );
        }
        //increase buffer size if necessary
        if ( m_data.remaining() < totalLen )
        {
            final ByteBuffer newBuf = ByteBuffer.allocate( m_data.capacity() + BUFFER_SIZE_STEP );
            m_data.flip();
            newBuf.put( m_data );
            m_data = newBuf;
        }
        //save offset of the first byte in a record
        m_index.put( entry.get( ID ), m_data.position() );
        //all fields are byte length followed by that number of bytes (field value)
        for ( final byte[] field : fields )
        {
            m_data.put( ( byte ) field.length ); // assume fields are no longer than 127 bytes
            m_data.put( field );
        }
    }
 
    public Map<String, String> getById( final String id )
    {
        final Integer pos = m_index.get( id );
        if ( pos == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_data.position( pos );
        for ( final String key : m_keys )
        {
            final byte len = (byte) ( m_data.get() & 0xFF );
            //reading straight from ByteBuffer array in order not to allocate a temp copy
            final String val = new String( m_data.array(), m_data.position(), len, UTF8 );
            m_data.position( m_data.position() + len );
            res.put( key, val );
        }
        return res;
    }
}
/**
 * Store data in the binary buffer. We assume that buffer will be populated first and then be queried
 * in the read-only mode.
 * We will keep map from key field to each record offset in the buffer in {@code m_index}.
 * The buffer itself {@code m_data} consists of field values prepended by a single byte length field
 * (Pascal convention). For this example we assume no field could be longer than 127 bytes.
 * Each field is converted into UTF-8 encoding before serialization. It allows us to use 1 byte per char
 * for most of european languages and not to lose any data during conversions.
 */
private static class SimpleBinaryStorage extends SimpleStorage
{
    private static final int BUFFER_SIZE_STEP = 10 * 1024 * 1024;
    //map from key to buffer position
    protected Map<String, Integer> m_index = new HashMap<String, Integer>( 1000 );

    protected ByteBuffer m_data = ByteBuffer.allocate( BUFFER_SIZE_STEP );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    protected List<String> m_keys = new ArrayList<String>( 5 );

    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
            m_keys.addAll( entry.keySet() );
        final List<byte[]> fields = new ArrayList<byte[]>( entry.size() );
        int totalLen = 0;
        //generate byte representations of all fields
        for ( final String key : m_keys )
        {
            final byte[] bytes = entry.get( key ).getBytes( UTF8 );
            totalLen += bytes.length + 1;
            fields.add( bytes );
        }
        //increase buffer size if necessary
        if ( m_data.remaining() < totalLen )
        {
            final ByteBuffer newBuf = ByteBuffer.allocate( m_data.capacity() + BUFFER_SIZE_STEP );
            m_data.flip();
            newBuf.put( m_data );
            m_data = newBuf;
        }
        //save offset of the first byte in a record
        m_index.put( entry.get( ID ), m_data.position() );
        //all fields are byte length followed by that number of bytes (field value)
        for ( final byte[] field : fields )
        {
            m_data.put( ( byte ) field.length ); // assume fields are no longer than 127 bytes
            m_data.put( field );
        }
    }

    public Map<String, String> getById( final String id )
    {
        final Integer pos = m_index.get( id );
        if ( pos == null )
            return null;
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_data.position( pos );
        for ( final String key : m_keys )
        {
            final byte len = (byte) ( m_data.get() & 0xFF );
            //reading straight from ByteBuffer array in order not to allocate a temp copy
            final String val = new String( m_data.array(), m_data.position(), len, UTF8 );
            m_data.position( m_data.position() + len );
            res.put( key, val );
        }
        return res;
    }
}

This implementation consumes 223 Mb (compared to previous 292 Mb). We have gained this improvement by replacing 5 million separate byte[] objects with a single ByteBuffer.

Using hash codes instead of index keys

The next idea was briefly mentioned in the hashCode method performance tuning article: we often can replace string map keys with their hash codes. In theory, this creates a possibility of using another key with the same hash code in the query. In practice, such possibility could be nearly eliminated by combining 2 independent 32 bit hash functions in a single long value: the easiest choices are String.hashCode, java.util.zip.CRC32 and java.util.zip.Adler32.

For the purposes of this article using String.hashCode is more than sufficient – all 1 million IDs have different hash codes in our example.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/**
 * Same as SimpleBinaryStorage, but now we include compression when all data is ready:
 * all index keys with unique hash codes are replaced with their hash codes (in a different map).
 */
private static class BinaryStorageWithHashing extends SimpleBinaryStorage
{
    private Map<Integer, Integer> m_hashIndex = new HashMap<Integer, Integer>( 1000 );
 
    public void pack()
    {
        //find all hash codes in the key set, calculate their frequency
        final Map<Integer, Integer> freq = new HashMap<Integer, Integer>( m_index.size() );
        for ( final String key : m_index.keySet() )
        {
            final int hashCode = key.hashCode();
            final Integer fr = freq.get( hashCode );
            freq.put( hashCode, fr == null ? 1 : fr + 1 );
        }
        //now keep records with freq > 1
        final Set<Integer> duplicates = new HashSet<Integer>( 100 );
        for ( final Map.Entry<Integer, Integer> entry : freq.entrySet() )
            if ( entry.getValue() > 1 )
                duplicates.add( entry.getKey() );
 
        //now generate 2 actual maps
        final Map<String, Integer> newMap = new HashMap<String, Integer>( 100 );
        for ( final Map.Entry<String, Integer> entry : m_index.entrySet() )
        {
            final int hash = entry.getKey().hashCode();
            if ( duplicates.contains( hash ) )
                newMap.put( entry.getKey(), entry.getValue() );
            else
                m_hashIndex.put( hash, entry.getValue() );
        }
        m_index = newMap;
        System.out.println( "Hash map size = " + m_hashIndex.size() );
        System.out.println( "String map size = " + m_index.size() );
    }
 
    public Map<String, String> getById( final String id )
    {
        Integer pos = m_hashIndex.get( id.hashCode() );
        if ( pos == null )
        {
            pos = m_index.get( id );
            if ( pos == null )
                return null;
        }
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_data.position( pos );
        for ( final String key : m_keys )
        {
            final byte len = (byte) ( m_data.get() & 0xFF );
            //reading straight from ByteBuffer array in order not to allocate a temp copy
            final String val = new String( m_data.array(), m_data.position(), len, UTF8 );
            m_data.position( m_data.position() + len );
            res.put( key, val );
        }
        return res;
    }
}
/**
 * Same as SimpleBinaryStorage, but now we include compression when all data is ready:
 * all index keys with unique hash codes are replaced with their hash codes (in a different map).
 */
private static class BinaryStorageWithHashing extends SimpleBinaryStorage
{
    private Map<Integer, Integer> m_hashIndex = new HashMap<Integer, Integer>( 1000 );

    public void pack()
    {
        //find all hash codes in the key set, calculate their frequency
        final Map<Integer, Integer> freq = new HashMap<Integer, Integer>( m_index.size() );
        for ( final String key : m_index.keySet() )
        {
            final int hashCode = key.hashCode();
            final Integer fr = freq.get( hashCode );
            freq.put( hashCode, fr == null ? 1 : fr + 1 );
        }
        //now keep records with freq > 1
        final Set<Integer> duplicates = new HashSet<Integer>( 100 );
        for ( final Map.Entry<Integer, Integer> entry : freq.entrySet() )
            if ( entry.getValue() > 1 )
                duplicates.add( entry.getKey() );

        //now generate 2 actual maps
        final Map<String, Integer> newMap = new HashMap<String, Integer>( 100 );
        for ( final Map.Entry<String, Integer> entry : m_index.entrySet() )
        {
            final int hash = entry.getKey().hashCode();
            if ( duplicates.contains( hash ) )
                newMap.put( entry.getKey(), entry.getValue() );
            else
                m_hashIndex.put( hash, entry.getValue() );
        }
        m_index = newMap;
        System.out.println( "Hash map size = " + m_hashIndex.size() );
        System.out.println( "String map size = " + m_index.size() );
    }

    public Map<String, String> getById( final String id )
    {
        Integer pos = m_hashIndex.get( id.hashCode() );
        if ( pos == null )
        {
            pos = m_index.get( id );
            if ( pos == null )
                return null;
        }
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_data.position( pos );
        for ( final String key : m_keys )
        {
            final byte len = (byte) ( m_data.get() & 0xFF );
            //reading straight from ByteBuffer array in order not to allocate a temp copy
            final String val = new String( m_data.array(), m_data.position(), len, UTF8 );
            m_data.position( m_data.position() + len );
            res.put( key, val );
        }
        return res;
    }
}

Hashing string map keys allows us to get rid of most of such keys in read-only storages at a price of slightly increased code complexity. It has a disadvantage of false positive requests, when another key has the same hash code as one stored in your index, but you can avoid this problem in many situations.

In our example you may still compare ID field value stored in the ByteBuffer with the request ID. If they are not equal, you may return null (this is not implemented in the example code).

This implementation consumes 188 Mb, which is another 15% less than previous implementation (223 Mb). That’s not much, but such transformation allows us to apply another couple of more powerful improvements.

Use Trove maps, get rid of ID field in the data storage

Finally, we came to the last couple of improvements we could make to our storage class. First of all, we have a large Map<Integer, Integer> – it must be replaced with Trove
TIntIntHashMap. If you haven’t read about Trove collections yet, then this example may persuade you to do it.

A second optimization which may go unnoticed is to remove an ID field value from a data storage, as we did in SimpleArrayStorageNoId. Previous time it saved us only tiny 8 Mb (8 bytes per record), because we still had the same value as a key in the index map. Now we got rid of most (in our example – all) keys, so we can remove it from the ByteBuffer too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
/**
 * BinaryStorageWithHashing was rewritten in order to use Trove. Besides, ID field value is not longer
 * stored in the data buffer. This is absolutely safe if we will assume that we will query only
 * existing keys. Otherwise, there may be a small possibility of another key having the same hashCode.
 * This possibility could be reduced by having a composite hash: hashCode and Adler32/CRC32.
 */
private static class BinaryHashingTroveStorageNoId implements Storage
{
    private static final int BUFFER_SIZE_STEP = 10 * 1024 * 1024;
    //map from key to buffer position
    private TObjectIntMap<String> m_index = new TObjectIntHashMap<String>( 1000 );
    private TIntIntMap m_hashIndex = new TIntIntHashMap( 1000, 0.75f, -1, -1 ); //0 is valid offset
 
    private ByteBuffer m_data = ByteBuffer.allocate( BUFFER_SIZE_STEP );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );
 
    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            m_keys.addAll( entry.keySet() );
            m_keys.remove( ID );
        }
 
        final List<byte[]> fields = new ArrayList<byte[]>( entry.size() - 1 );
        int totalLen = 0;
        //generate byte representations of all fields
        for ( final String key : m_keys )
        {
            final byte[] bytes = entry.get( key ).getBytes( UTF8 );
            totalLen += bytes.length + 1;
            fields.add( bytes );
        }
        //increase buffer size if necessary
        if ( m_data.remaining() () {
            @Override
            public boolean execute( final String key )
            {
                final int hashCode = key.hashCode();
                freq.adjustOrPutValue( hashCode, 1, 1 );
                return true;
            }
        });
        //now keep records with freq > 1
        final TIntSet duplicates = new TIntHashSet( 100 );
        freq.forEachEntry( new TIntIntProcedure() {
            @Override
            public boolean execute( final int key, final int value )
            {
                if ( value > 1 )
                    duplicates.add( key );
                return true;
            }
        });
 
        //now generate 2 actual maps
        final TObjectIntMap<String> newMap = new TObjectIntHashMap<String>( 100 );
        m_index.forEachEntry( new TObjectIntProcedure<String>() {
            @Override
            public boolean execute( final String key, final int value )
            {
                final int hash = key.hashCode();
                if ( duplicates.contains( hash ) )
                    newMap.put( key, value );
                else
                    m_hashIndex.put( hash, value );
                return true;
            }
        });
 
        m_index = newMap;
        System.out.println( "Hash map size = " + m_hashIndex.size() );
        System.out.println( "String map size = " + m_index.size() );
        System.out.println( "Binary data size = " + m_data.position() );
    }
 
    public Map<String, String> getById( final String id )
    {
        int pos = m_hashIndex.get( id.hashCode() );
        if ( pos == -1 )
        {
            Integer pos2 = m_index.get( id );
            if ( pos2 == null )
                return null;
            pos = pos2;
        }
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_data.position( pos );
        for ( final String key : m_keys )
        {
            final byte len = (byte) ( m_data.get() & 0xFF );
            //reading straight from ByteBuffer array in order not to allocate a temp copy
            final String val = new String( m_data.array(), m_data.position(), len, UTF8 );
            m_data.position( m_data.position() + len );
            res.put( key, val );
        }
        res.put( ID, id );
        return res;
    }
}
/**
 * BinaryStorageWithHashing was rewritten in order to use Trove. Besides, ID field value is not longer
 * stored in the data buffer. This is absolutely safe if we will assume that we will query only
 * existing keys. Otherwise, there may be a small possibility of another key having the same hashCode.
 * This possibility could be reduced by having a composite hash: hashCode and Adler32/CRC32.
 */
private static class BinaryHashingTroveStorageNoId implements Storage
{
    private static final int BUFFER_SIZE_STEP = 10 * 1024 * 1024;
    //map from key to buffer position
    private TObjectIntMap<String> m_index = new TObjectIntHashMap<String>( 1000 );
    private TIntIntMap m_hashIndex = new TIntIntHashMap( 1000, 0.75f, -1, -1 ); //0 is valid offset

    private ByteBuffer m_data = ByteBuffer.allocate( BUFFER_SIZE_STEP );
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );

    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            m_keys.addAll( entry.keySet() );
            m_keys.remove( ID );
        }

        final List<byte[]> fields = new ArrayList<byte[]>( entry.size() - 1 );
        int totalLen = 0;
        //generate byte representations of all fields
        for ( final String key : m_keys )
        {
            final byte[] bytes = entry.get( key ).getBytes( UTF8 );
            totalLen += bytes.length + 1;
            fields.add( bytes );
        }
        //increase buffer size if necessary
        if ( m_data.remaining() () {
            @Override
            public boolean execute( final String key )
            {
                final int hashCode = key.hashCode();
                freq.adjustOrPutValue( hashCode, 1, 1 );
                return true;
            }
        });
        //now keep records with freq > 1
        final TIntSet duplicates = new TIntHashSet( 100 );
        freq.forEachEntry( new TIntIntProcedure() {
            @Override
            public boolean execute( final int key, final int value )
            {
                if ( value > 1 )
                    duplicates.add( key );
                return true;
            }
        });

        //now generate 2 actual maps
        final TObjectIntMap<String> newMap = new TObjectIntHashMap<String>( 100 );
        m_index.forEachEntry( new TObjectIntProcedure<String>() {
            @Override
            public boolean execute( final String key, final int value )
            {
                final int hash = key.hashCode();
                if ( duplicates.contains( hash ) )
                    newMap.put( key, value );
                else
                    m_hashIndex.put( hash, value );
                return true;
            }
        });

        m_index = newMap;
        System.out.println( "Hash map size = " + m_hashIndex.size() );
        System.out.println( "String map size = " + m_index.size() );
        System.out.println( "Binary data size = " + m_data.position() );
    }

    public Map<String, String> getById( final String id )
    {
        int pos = m_hashIndex.get( id.hashCode() );
        if ( pos == -1 )
        {
            Integer pos2 = m_index.get( id );
            if ( pos2 == null )
                return null;
            pos = pos2;
        }
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_data.position( pos );
        for ( final String key : m_keys )
        {
            final byte len = (byte) ( m_data.get() & 0xFF );
            //reading straight from ByteBuffer array in order not to allocate a temp copy
            final String val = new String( m_data.array(), m_data.position(), len, UTF8 );
            m_data.position( m_data.position() + len );
            res.put( key, val );
        }
        res.put( ID, id );
        return res;
    }
}

This implementation requires 92 Mb memory. Compare it with 188 Mb required for previous implementation, or with 705 Mb required by an original map of maps. It makes a really great difference in the amount of data you will be able to process.

sun.misc.Unsafe instead of ByteBuffer

Just for demonstration purposes I’ll provide a version of the last storage class which uses sun.misc.Unsafe in order to write data into memory instead of ByteBuffer.
For this task it is better to use official and documented ByteBuffer, because Unsafe does not provide you any extra functionality in this example.

I will use slightly updated UnsafeMemory class, which I have already used in Various methods of binary serialization in Java article, while comparing performance of various types of ByteBuffers with sun.misc.Unsafe. UnsafeMemory serialization format is slightly different from the one used in this article – it uses 4 bytes to store byte[] length. Due to this fact memory consumption of the following example will be slightly higher.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
/**
 * Same as BinaryHashingTroveStorageNoId, but uses sun.misc.Unsafe for data storage instead of ByteBuffer.
 * Due to UnsafeMemory class implementation details, we no longer limit field length by 127 bytes
 */
private static class UnsafeHashingTroveStorageNoId implements Storage
{
    private static final int BUFFER_SIZE_STEP = 10 * 1024 * 1024;
    //map from key to buffer position
    private TObjectIntMap<String> m_index = new TObjectIntHashMap<String>( 1000 );
    private TIntIntMap m_hashIndex = new TIntIntHashMap( 1000, 0.75f, -1, -1 ); //0 is valid offset
 
    private byte[] m_data = new byte[ BUFFER_SIZE_STEP ];
    private UnsafeMemory m_mem = new UnsafeMemory( m_data );
 
    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );
 
    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            m_keys.addAll( entry.keySet() );
            m_keys.remove( ID );
        }
 
        final List<byte[]> fields = new ArrayList<byte[]>( entry.size() - 1 );
        int totalLen = 0;
        //generate byte representations of all fields
        for ( final String key : m_keys )
        {
            final byte[] bytes = entry.get( key ).getBytes( UTF8 );
            totalLen += bytes.length + 4;
            fields.add( bytes );
        }
        //increase buffer size if necessary
        if ( m_mem.remaining() () {
            @Override
            public boolean execute( final String key )
            {
                final int hashCode = key.hashCode();
                freq.adjustOrPutValue( hashCode, 1, 1 );
                return true;
            }
        });
        //now keep records with freq > 1
        final TIntSet duplicates = new TIntHashSet( 100 );
        freq.forEachEntry( new TIntIntProcedure() {
            @Override
            public boolean execute( final int key, final int value )
            {
                if ( value > 1 )
                    duplicates.add( key );
                return true;
            }
        });
 
        //now generate 2 actual maps
        final TObjectIntMap<String> newMap = new TObjectIntHashMap<String>( 100 );
        m_index.forEachEntry( new TObjectIntProcedure<String>() {
            @Override
            public boolean execute( final String key, final int value )
            {
                final int hash = key.hashCode();
                if ( duplicates.contains( hash ) )
                    newMap.put( key, value );
                else
                    m_hashIndex.put( hash, value );
                return true;
            }
        });
 
        m_index = newMap;
        System.out.println( "Hash map size = " + m_hashIndex.size() );
        System.out.println( "String map size = " + m_index.size() );
 
        System.out.println( "Binary data size = " + m_mem.position() );
    }
 
    public Map<String, String> getById( final String id )
    {
        int pos = m_hashIndex.get( id.hashCode() );
        if ( pos == -1 )
        {
            Integer pos2 = m_index.get( id );
            if ( pos2 == null )
                return null;
            pos = pos2;
        }
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_mem.position( pos );
        for ( final String key : m_keys )
        {
            final String val = new String( m_mem.getByteArray(), UTF8 );
            res.put( key, val );
        }
        res.put( ID, id );
        return res;
    }
}
/**
 * Same as BinaryHashingTroveStorageNoId, but uses sun.misc.Unsafe for data storage instead of ByteBuffer.
 * Due to UnsafeMemory class implementation details, we no longer limit field length by 127 bytes
 */
private static class UnsafeHashingTroveStorageNoId implements Storage
{
    private static final int BUFFER_SIZE_STEP = 10 * 1024 * 1024;
    //map from key to buffer position
    private TObjectIntMap<String> m_index = new TObjectIntHashMap<String>( 1000 );
    private TIntIntMap m_hashIndex = new TIntIntHashMap( 1000, 0.75f, -1, -1 ); //0 is valid offset

    private byte[] m_data = new byte[ BUFFER_SIZE_STEP ];
    private UnsafeMemory m_mem = new UnsafeMemory( m_data );

    //order in which we have to access fields, in order not to hardcode it. We assume no optional fields.
    private List<String> m_keys = new ArrayList<String>( 5 );

    public void addEntry( final Map<String, String> entry )
    {
        if ( m_keys.isEmpty() )
        {
            m_keys.addAll( entry.keySet() );
            m_keys.remove( ID );
        }

        final List<byte[]> fields = new ArrayList<byte[]>( entry.size() - 1 );
        int totalLen = 0;
        //generate byte representations of all fields
        for ( final String key : m_keys )
        {
            final byte[] bytes = entry.get( key ).getBytes( UTF8 );
            totalLen += bytes.length + 4;
            fields.add( bytes );
        }
        //increase buffer size if necessary
        if ( m_mem.remaining() () {
            @Override
            public boolean execute( final String key )
            {
                final int hashCode = key.hashCode();
                freq.adjustOrPutValue( hashCode, 1, 1 );
                return true;
            }
        });
        //now keep records with freq > 1
        final TIntSet duplicates = new TIntHashSet( 100 );
        freq.forEachEntry( new TIntIntProcedure() {
            @Override
            public boolean execute( final int key, final int value )
            {
                if ( value > 1 )
                    duplicates.add( key );
                return true;
            }
        });

        //now generate 2 actual maps
        final TObjectIntMap<String> newMap = new TObjectIntHashMap<String>( 100 );
        m_index.forEachEntry( new TObjectIntProcedure<String>() {
            @Override
            public boolean execute( final String key, final int value )
            {
                final int hash = key.hashCode();
                if ( duplicates.contains( hash ) )
                    newMap.put( key, value );
                else
                    m_hashIndex.put( hash, value );
                return true;
            }
        });

        m_index = newMap;
        System.out.println( "Hash map size = " + m_hashIndex.size() );
        System.out.println( "String map size = " + m_index.size() );

        System.out.println( "Binary data size = " + m_mem.position() );
    }

    public Map<String, String> getById( final String id )
    {
        int pos = m_hashIndex.get( id.hashCode() );
        if ( pos == -1 )
        {
            Integer pos2 = m_index.get( id );
            if ( pos2 == null )
                return null;
            pos = pos2;
        }
        final Map<String, String> res = new HashMap<String, String>( m_keys.size() );
        m_mem.position( pos );
        for ( final String key : m_keys )
        {
            final String val = new String( m_mem.getByteArray(), UTF8 );
            res.put( key, val );
        }
        res.put( ID, id );
        return res;
    }
}

You can find UnsafeMemory class in the source code archive for this article.

Summary

  • If you want to keep a large number of read-only records in memory, consider storing them in the most possible compact form in a ByteBuffer (or using sun.misc.Unsafe).
  • Use Trove maps for indices.
  • Check if you need to keep your ID field values as maps keys (you can still check ID field stored in ByteBuffer) or hash code is sufficient as a key.
  • If you know the possible query keys in advance – don’t store IDs at all – hash code is sufficient replacement for map keys in a lot of situations.

2 thoughts on “Use case: Optimizing memory footprint of a read only csv file (Trove, Unsafe, ByteBuffer, data compression)

  1. Shivang

    Nice article! Many thanks 🙂

    Minor issue: in the code snippet under “Storing all records in a ByteBuffer with a separate index”, it seems some code is missing regarding the comment “increase buffer size if necessary”; the next method getById( final String id ) starts mid-statement.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *