Use case: how to compact a long-to-long mapping

by Mikhail Vorontsov

In this post we will discuss what techniques could be applied to an integer-to-integer map in order to further reduce memory consumption. Trove collections will be used in this post.

Suppose we have 2 similar message streams. Each stream contains “main” messages, which define new identifiers, and “child” messages, which refer to earlier defined identifiers. Identifiers in both streams are positive long values starting from 1, but there may be some gaps in both identifier sequences.

We need to merge two such streams into one stream, keeping references between “main” and “child” messages. We are allowed to change any identifiers as long as references will not be broken.

Here is an example of how it should work. First id you see is 1 from stream A – you will map it to 1. Next comes 1 from stream B – you will map it to 2. The next one is 4 from stream B (two ids are missing in stream B) – you will map it to 3. It is followed by 2 from stream A – it will be mapped to 4. Now you’ll get 1 from stream A again – you should use previously assigned 1 in this case. It is followed by 2 from stream A – you should use previously assigned 4. And so on…

Original implementation used two Map<Long, Long> to store mappings for stream A and stream B.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public class LongToLongMap implements ILongToLongMap
{
    private long m_id = 1;
    private final Map<Long, Long> m_first = new HashMap<Long, Long>( 1000 );
    private final Map<Long, Long> m_second = new HashMap<Long, Long>( 1000 );
 
    public long map( final long srcId, final boolean firstStream )
    {
        final Map<Long, Long> map = firstStream ? m_first : m_second;
        Long val = map.get( srcId );
        if ( val == null )
        {
            val = m_id++;
            map.put( srcId, val );
        }
        return val;
    }
}
public class LongToLongMap implements ILongToLongMap
{
    private long m_id = 1;
    private final Map<Long, Long> m_first = new HashMap<Long, Long>( 1000 );
    private final Map<Long, Long> m_second = new HashMap<Long, Long>( 1000 );

    public long map( final long srcId, final boolean firstStream )
    {
        final Map<Long, Long> map = firstStream ? m_first : m_second;
        Long val = map.get( srcId );
        if ( val == null )
        {
            val = m_id++;
            map.put( srcId, val );
        }
        return val;
    }
}

Of course, the first you should do is to replace them with TLongLongHashMaps.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
public class LongToLongTroveMap implements ILongToLongMap
{
    private long m_id = 1;
    private final TLongLongMap m_first = new TLongLongHashMap( 1000 );
    private final TLongLongMap m_second = new TLongLongHashMap( 1000 );
 
    public long map( final long srcId, final boolean firstStream )
    {
        final TLongLongMap map = firstStream ? m_first : m_second;
        long val = map.get( srcId );
        //zero has the same meaning as null in JDK maps, zero is not fixed and can be changed via map constructor
        if ( val == 0L ) 
        {
            val = m_id++;
            map.put( srcId, val );
        }
        return val;
    }
}
public class LongToLongTroveMap implements ILongToLongMap
{
    private long m_id = 1;
    private final TLongLongMap m_first = new TLongLongHashMap( 1000 );
    private final TLongLongMap m_second = new TLongLongHashMap( 1000 );

    public long map( final long srcId, final boolean firstStream )
    {
        final TLongLongMap map = firstStream ? m_first : m_second;
        long val = map.get( srcId );
        //zero has the same meaning as null in JDK maps, zero is not fixed and can be changed via map constructor
        if ( val == 0L ) 
        {
            val = m_id++;
            map.put( srcId, val );
        }
        return val;
    }
}

The first useful property to notice is that all our identifiers are starting from 1 and growing with rather infrequent gaps. This means that too many of them could fit into int type. But we don’t want to change our interface and still want to support long-to-long mapping. In order to achieve this, we need to implement 2 inner classes inside our map. The first one (default) will be int-to-int mapping and the second one is the original long-to-long mapping.

There are two possible policies to implement. The first one is to start with only int-to-int map, and, in case of getting a key or a value outside of int values, copy all contents of int-to-int into a long-to-long map and continue working with it regardless of following values.

The second policy handles mix of int and long keys and values better. It creates both maps at the beginning and stores data into an appropriate one. Let’s implement this policy.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public class DualLongToLongMap implements ILongToLongMap
{
    private long m_id = 1;
    private final DualMap m_first = new DualMap();
    private final DualMap m_second = new DualMap();
    
    public long map( final long srcId, final boolean firstStream ) 
    {
        final DualMap map = firstStream ? m_first : m_second;
        long val = map.get( srcId );
        if ( val == 0 )
        {
            val = m_id++;
            map.put( srcId, val );
        }
        return val;
    }
    
    private static class DualMap
    {
        private final long MAX_INT = Integer.MAX_VALUE; //not to cast each time
        
        private final TIntIntMap m_int = new TIntIntHashMap( 1000 );
        private final TLongLongMap m_long = new TLongLongHashMap( 100 );
 
        public void put( final long key, final long value )
        {
            if ( key <= MAX_INT && value <= MAX_INT )
                m_int.put( ( int ) key, ( int ) value );
            else
                m_long.put( key, value );
        }
        
        public long get( final long key )
        {
            if ( key > MAX_INT ) //for long keys query only long map
                return m_long.get( key );
            //for int keys check both maps
            final int res = m_int.get( ( int ) key );
            return res != 0 ? res : m_long.get( key );
        }
    }
}
public class DualLongToLongMap implements ILongToLongMap
{
    private long m_id = 1;
    private final DualMap m_first = new DualMap();
    private final DualMap m_second = new DualMap();
    
    public long map( final long srcId, final boolean firstStream ) 
    {
        final DualMap map = firstStream ? m_first : m_second;
        long val = map.get( srcId );
        if ( val == 0 )
        {
            val = m_id++;
            map.put( srcId, val );
        }
        return val;
    }
    
    private static class DualMap
    {
        private final long MAX_INT = Integer.MAX_VALUE; //not to cast each time
        
        private final TIntIntMap m_int = new TIntIntHashMap( 1000 );
        private final TLongLongMap m_long = new TLongLongHashMap( 100 );

        public void put( final long key, final long value )
        {
            if ( key <= MAX_INT && value <= MAX_INT )
                m_int.put( ( int ) key, ( int ) value );
            else
                m_long.put( key, value );
        }
        
        public long get( final long key )
        {
            if ( key > MAX_INT ) //for long keys query only long map
                return m_long.get( key );
            //for int keys check both maps
            final int res = m_int.get( ( int ) key );
            return res != 0 ? res : m_long.get( key );
        }
    }
}

Ok, now instead of storing two longs we have two ints. Can we improve it? Yes, of course! The task doesn’t tell us that anything will be removed from the mapping during runtime. This means that we will keep a map from most values starting from 1 to something else (1 to A, 2 to B, 3 to C, 5 to D and so on). We can actually store all values in an array. Keys will become array indices. This means 4 bytes per entry!

It worth noticing that if removing some entries from a map will be required, when only a map is suitable in general case, because only a map can compactly represent a sparse mapping. But in case of infrequent removals, we may still start from an array and just count a number of missing records in the array (highest array index minus number of stored elements). If such number will get higher than a predefined threshold, we can copy all data into a map and keep using it afterwards.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
//in DualMap  TIntIntMap m_int is replaced with ArrayMap:
private final ArrayMap m_int = new ArrayMap();
 
private static class ArrayMap
{
    private final TIntArrayList m_ar = new TIntArrayList( 1000 );
 
    public void put( final int key, final int value )
    {
        //extend array list up to a length required to keep a given key
        for ( int i = m_ar.size(); i <= key; ++i )
            m_ar.add( 0 );
        //here we know that array list has enough space
        m_ar.setQuick( key, value );
    }
 
    public int get( final int key )
    {
        if ( key >= m_ar.size() )
            return 0;
        return m_ar.getQuick( key );
    }
}
//in DualMap  TIntIntMap m_int is replaced with ArrayMap:
private final ArrayMap m_int = new ArrayMap();

private static class ArrayMap
{
    private final TIntArrayList m_ar = new TIntArrayList( 1000 );

    public void put( final int key, final int value )
    {
        //extend array list up to a length required to keep a given key
        for ( int i = m_ar.size(); i <= key; ++i )
            m_ar.add( 0 );
        //here we know that array list has enough space
        m_ar.setQuick( key, value );
    }

    public int get( final int key )
    {
        if ( key >= m_ar.size() )
            return 0;
        return m_ar.getQuick( key );
    }
}

Let’s check how much memory will consume a mapping from 5M identifiers in the each stream (total 10M identifiers).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
private static void testImpl( final ILongToLongMap map, final long highValue )
{
    System.gc();
    final long start = System.currentTimeMillis();
    boolean stream = true;
    for ( long i = 1; i < highValue; ++i )
    {
        map.map( i, stream );
        stream = !stream;
    }
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Impl: " + map.getClass().getName() + "Time to fill = " + time / 1000.0 + " sec" );
    System.gc();
    System.out.println( "Memory occupied = " + ( Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() ) / 1024.0 / 1024.0 + " Mb" );
    if ( map.map( 100, true ) == 100000 )
    {
        System.out.println( "1" );
    }
    System.gc();
}
private static void testImpl( final ILongToLongMap map, final long highValue )
{
    System.gc();
    final long start = System.currentTimeMillis();
    boolean stream = true;
    for ( long i = 1; i < highValue; ++i )
    {
        map.map( i, stream );
        stream = !stream;
    }
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Impl: " + map.getClass().getName() + "Time to fill = " + time / 1000.0 + " sec" );
    System.gc();
    System.out.println( "Memory occupied = " + ( Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() ) / 1024.0 / 1024.0 + " Mb" );
    if ( map.map( 100, true ) == 100000 )
    {
        System.out.println( "1" );
    }
    System.gc();
}
  Map<Long, Long> TLongLongHashMap TIntIntHashMap and TLongLongHashMap ArrayMap and TLongLongHashMap
Memory 829.65 Mb 556.4 Mb 294.61 Mb 125.4 Mb
Time 2.141 sec 0.714 sec 0.706 sec 0.371 sec

As you can see, the difference in CPU performance is not that big (even 5 times difference between best and worst cases is just a matter of seconds), but the difference in the amount of occupied memory is huge. The best implementation requires 6.5 times less memory, occupying only about 13 bytes per entry, compared to 83 bytes per entry in the Map<Long, Long> version. All these numbers, of course, vary depending on the fill factor and number of elements in data structures. Such a difference in the memory consumption may mean a few extra years for your program to work on the same hardware.

Summary

Analyze properties of the data you have to store in the largest data structures in your programs. Try to identify cases when most of your data can fit into a more compact data type than an original one. Number-to-number maps can be especially effectively compacted if you can notice that keys are nearly consecutive values. In this case a map can be converted into the array.


Leave a Reply

Your email address will not be published. Required fields are marked *