Tag Archives: hashcode

Use case: Optimizing memory footprint of a read only csv file (Trove, Unsafe, ByteBuffer, data compression)

by Mikhail Vorontsov

Suppose your application has to obtain some data from an auxiliary data source, which could be a csv file. Your csv file will contain several fields, one of those is used as a field ID. You need to keep all that file in memory and provide fast access to records by an ID field. Your additional target is to consume as little memory as possible, while keeping access cost as low as possible. In this article we will process fake person records. Here is an example:

{id=idnum10, surname=Smith10, address=10 One Way Road, Springfield, NJ, DOB=01/11/1965, names=John Paul 10 Ringo}
    

All these records are generated by the following class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
private static class DataGenerator
{
    private int cnt = 1;
 
    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }
 
    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}
private static class DataGenerator
{
    private int cnt = 1;

    public Map<String, String> getNextEntry()
    {
        final Map<String, String> res = new HashMap<String, String>( 8 );
        res.put( ID, "idnum" + cnt );
        res.put( "names", "John Paul " + cnt + " Ringo" );
        res.put( "surname", "Smith" + cnt );
        res.put( "DOB", "01/" + two((cnt % 31) + 1) + "/1965" ); //date of birth, MM/DD/YYYY
        res.put( "address", cnt + " One Way Road, Springfield, NJ" );
        ++cnt;
        return res;
    }

    private String two( final int val )
    {
        return val < 10 ? "0" + val : Integer.toString( val );
    }
}

Simple approach – map of maps

We always have to start from something simple and easy to support. In this case it may be a map of maps: outer map is indexed by ID field and th inner ones are indexed by field names.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );
 
        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }
 
        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    
    /**
     * Initial version. Outer map indexed by a key field value, inner map is field name to field value.
     * All field names are interned, but we still have to pay for storing references to field names.
     */
    private static class SimpleMapStorage extends SimpleStorage
    {
        private final Map<String, Map<String, String>> m_data = new HashMap<String, Map<String, String>>( 1000 );

        public void addEntry( final Map<String, String> entry )
        {
            m_data.put( entry.get( ID ), entry );
        }

        public Map<String, String> getById( final String id )
        {
            return m_data.get( id );
        }
    }
    

For testing purposes, all storage implementations will either implement Storage interface or extend SimpleStorage class. You will see the purpose of pack method in the more advanced examples.

1
2
3
4
5
6
7
8
9
10
11
12
13
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}
 
public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}
private interface Storage
{
    public void addEntry( final Map<String, String> entry );
    public Map<String, String> getById( final String id );
    public void pack();
}

public static abstract class SimpleStorage implements Storage
{
    public void pack()
    {
    }
}

All storage implementations will be tested by the following method:

1
2
3
4
5
6
7
8
9
10
11
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}
private static void testStorage(final int recordCount, final Storage storage)
{
    final DataGenerator generator = new DataGenerator();
    for ( int i = 0; i < recordCount; ++i )
        storage.addEntry( generator.getNextEntry() );
    storage.pack();
    System.gc();
    final long mem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
    System.out.println( storage.getClass().getName() + ": " + mem / 1024.0 / 1024.0 + " MB");
    System.out.println( storage.getById( "idnum10" ) ); //in order to retain storage in memory
}

For every implementation in this article we will try to create 1 million entries. SimpleMapStorage consumes 706 Mb to store 1M records. The actual data size is about 82 Mb, which means that this simple implementation wastes about 85% of consumed memory. Yes, straightforward solutions for big data storage do not work well in Java…

Continue reading

Changes to String internal representation made in Java 1.7.0_06

by Mikhail Vorontsov

This post was updated on 19 Nov 2013 in order to reflect Java 8 changes.


This post was updated on 28 Nov 2013 in order to reflect Java 7u40 changes (thanks to Sunny Chan and his colleagues for pointing my attention at this JDK update).

Sharing an underlying char[]

An original String implementation has 4 non static field: char[] value with string characters, int offset and int count with an index of the first character to use from value and a number of characters to use and int hash with a cached value of a String hash code. As you can see, in a very large number of cases a String will have offset = 0 and count = value.length. The only exception to this rule were the strings created via String.substring calls and all API calls using this method internally (like Pattern.split).

String.substring created a String, which shared an internal char[] value with an original String, which allowed you:

  1. To save some memory by sharing character data
  2. To run String.substring in a constant time ( O(1) )

At the same time such feature was a source of a possible memory leak: if you extract a tiny substring from an original huge string and discard that original string, you will still have a live reference to the underlying huge char[] value taken from an original String. The only way to avoid it was to call a new String( String ) constructor on such string – it made a copy of a required section of underlying char[], thus unlinking your shorter string from its longer “parent”.

From Java 1.7.0_06 (as well as in current versions of Java 8 – Nov 13) offset and count fields were removed from a String. This means that you can’t share a part of an underlying char[] value anymore. Now you can forget about a memory leak described above and never ever use new String(String) constructor anymore. As a drawback, you now have to remember that String.substring has now a linear complexity instead of a constant one.

Continue reading