Trove library: using primitive collections for performance

by Mikhail Vorontsov

This article will describe Trove collections, which is a set of high-performance collection implementations. The latest version of Trove (it is 3.0 at the time of writing) would be described here.

Why should you use Trove? Why not to keep using well known JDK collections? The answer is performance and memory consumption. Trove doesn’t use any java.lang.Number subclasses internally, so you don’t have to pay for boxing/unboxing each time you want to pass/query a primitive value to/from the collection. Besides (it is the same for JDK 6 and 7), you wouldn’t waste N+8 bytes (rounded up to the nearest 8 bytes) to store N-bytes primitive value (yes, there are some exceptions like cached values between -128 and 127 for integer data types). Trove also doesn’t create Map.Entry for each key-value pair unlike java.util.HashSet, reducing map memory footprint. For sets, it doesn’t use a Map internally, just keeping set values.

There are 3 main collection types in Trove: array lists, sets and maps. There are also queues, stacks and linked lists, but they are not so important (and usually instances of these collections tend to be rather small).


Array lists

All array lists are built on array of corresponding data type (for example, int[] for TIntArrayList. There is a small problem which you should deal with in Trove array lists – a value for absent elements (default value). It is zero by default, but you can override it using

1
2
3
public TIntArrayList(int capacity, int no_entry_value);
or
public static TIntArrayList wrap(int[] values, int no_entry_value);
public TIntArrayList(int capacity, int no_entry_value);
or
public static TIntArrayList wrap(int[] values, int no_entry_value);

There are 2 useful methods called getQuick/setQuick – they just access underlying array without any additional checks. As a side-effect, they would allow to access elements between list size and capacity (still don’t use it too much – it is still better to add values legally and when getQuick values as long as you are inside array list boundaries.

Each Trove array list has several helper methods which replace java.util.Collections methods:

1
2
3
4
5
6
7
8
9
10
11
12
public void reverse()
public void reverse(int from, int to)
public void shuffle(java.util.Random rand)
public void sort()
public void sort(int fromIndex, int toIndex)
public void fill(int val)
public void fill(int fromIndex, int toIndex, int val)
public int binarySearch(int value)
public int binarySearch(int value, int fromIndex, int toIndex)
public int max()
public int min()
public int sum()
public void reverse()
public void reverse(int from, int to)
public void shuffle(java.util.Random rand)
public void sort()
public void sort(int fromIndex, int toIndex)
public void fill(int val)
public void fill(int fromIndex, int toIndex, int val)
public int binarySearch(int value)
public int binarySearch(int value, int fromIndex, int toIndex)
public int max()
public int min()
public int sum()

Trove array lists support three types of iterations:

  • By index – use any loop from index = 0 to list.size() – 1. This approach is useful when you need to know the index of current element
  • Using iterator: there are several iterators defined for each primitive type. This one seems to be less useful because it doesn’t provide any new functionality compared to other methods.
  • Using functional approach – user-defined method (more precisely – one of several primitive type-based interface implementations should be provided) would be called for each value in the list.

Two first iteration methods are rather common to Java developers, so let’s take a closer look at the last one. There are two iteration methods defined for Trove array lists (example for TIntArrayList):

1
2
public boolean forEach( TIntProcedure procedure );
public boolean forEachDescending( TIntProcedure procedure );
public boolean forEach( TIntProcedure procedure );
public boolean forEachDescending( TIntProcedure procedure );

First method iterates all elements from the beginning of the list to the end, second – from the end to the beginning. Implemented method should usually return true. If it would return false, iteration would stop. That’s how one could print all elements in the array list:

1
2
3
4
5
6
7
lst.forEach( new TIntProcedure() {
    @Override
    public boolean execute( final int value ) {
        System.out.println( value );
        return true;
    }
});
lst.forEach( new TIntProcedure() {
    @Override
    public boolean execute( final int value ) {
        System.out.println( value );
        return true;
    }
});

There is another pair of methods:

1
2
public TIntList grep( TIntProcedure condition );
public TIntList inverseGrep( TIntProcedure condition );
public TIntList grep( TIntProcedure condition );
public TIntList inverseGrep( TIntProcedure condition );

They grep the array list. First method accumulates all values for which implemented method returns true, second does the opposite, accumulating all values for which false was returned.

This approach has one disadvantage – if you need to update a primitive variable from these methods, you’ll have to use either any of Mutable* classes from Apache Commons Lang package or implement one yourself:

1
2
3
4
private static final class MutableInt
{
    public int value;
}
private static final class MutableInt
{
    public int value;
}

You have to use such class because all variables accessed from inner classes must be final, so you can’t use primitive type variable, but instead use not-final field of some class.

for*-methods are also IDE friendly – most IDEs support autocompleting methods parameters, including interface implementations. You simply need to type forEach( new and press your IDE autocomplete key combination.

Sets

Trove sets use one array to store set values (using open indexed approach) and one more array of bytes to save each cell status: open, used or deleted. Besides, there are THashSet<E> and TLinkedHashSet<E>, which can replace java.util.HashSet and java.util.LinkedHashSet from JDK.

Both JDK sets are based on the HashMap<E,Object>, where Object values are just cell state flags. HashMaps, in turn, are based on array of Map.Entry. As a result, JDK HashSet uses three Objects per each stored values. This is especially expensive in case when you need to store primitive values in a set.

JDK LinkedHashSet is built on top of HashSet by adding a linked list to keep order in which records have been inserted. Let’s see how much memory will consume a set of 10M integers:

JDK HashSet<Integer> JDK LinkedHashSet<Integer> Trove THashSet<Integer> Trove TLinkedHashSet<Integer> Trove TIntSet
525M 602M 236M 312M 103M

There is no difference between Java 6 and 7 in this test. Trove THashSet uses simple array of Objects to keep the set. TLinkedHashSet adds TIntList to keep order of records. So, it makes sense to use Trove Object sets as a replacement for JDK sets, because they even implement Collection, Set and Iterable interfaces.

Note, that for a set of integer values (short/int/long) it is most probably worth replacing such set with a bit set.
Read “Bit sets” post for more details.

Let’s return to primitive type sets. addAll, containsAll, removeAll and retainAll methods are overloaded in order to support java.util.Collection as well as TTypeCollection, where Type is any of primitive types, and primitive type arrays as arguments, providing compatibility to JDK collections, Trove collections and lower level code.

Maps

Trove maps could be divided into 4 groups:

  1. Maps from one primitive type to another.
  2. Maps from objects to primitive types and vice versa.
  3. java.util.HashMap replacement called THashMap.
  4. Maps from objects to primitive types and to objects which use user-defined hashing strategy.

All maps are built on top of three arrays – one for keys, another for values and the last one for cell states (free, occupied, deleted). All of them use open addressing to determine position of key in a map. How does it work?

Hash code of key is calculated and modulo of dividing it by internal array length is taken. That would be initial position in the search. After that some prime number would be added to this position (still by modulo equal to array length) to find next attempt position. Any other function could also be applied to define new position. Finally an empty cell would be found and new element would be inserted at that position. Same logic is used to find a key in the set, the only difference is that search key should be compared to existing keys on each step.

Methods called keys, keySet, valueCollection and values could be used to iterate map keys and values, but it is more recommended to use either iterator or functional approach: forEachEntry/Key/Value. Both these approaches are roughly similar fast, but use iterator if you need to throw an exception or to accumulate some scalar value. Otherwise use any of for* methods.

Trove map iterators are slightly unusual. The main difference from JDK iterators is that two fields are accessible on each step – a key and a value. That’s why you can’t implicitly advance to the next value by calling next method if it was here. Instead, there is advance method, whose sole purpose is to get to the next map entry. Key and value are accessible via key and value methods. Current entry could be removed from map by remove method. Current entry value could be changed by setValue method.

Iterators JavaDoc provides perfect examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// accessing keys/values through an iterator:
for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) {
    it.advance();
    if ( satisfiesCondition( it.key() ) {
        doSomethingWithValue( it.value() );
    }
}
 
// modifying values in-place through iteration:
for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) {
    it.advance();
    if ( satisfiesCondition( it.key() ) {
        it.setValue( newValueForKey( it.key() ) );
    }
}
 
// deleting entries during iteration:
for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) {
    it.advance();
    if ( satisfiesCondition( it.key() ) {
        it.remove();
    }
}
 
// faster iteration by avoiding hasNext():
TDoubleIntIterator iterator = map.iterator();
for ( int i = map.size(); i-- > 0; ) {
    iterator.advance();
    doSomethingWithKeyAndValue( iterator.key(), iterator.value() );
}
// accessing keys/values through an iterator:
for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) {
    it.advance();
    if ( satisfiesCondition( it.key() ) {
        doSomethingWithValue( it.value() );
    }
}

// modifying values in-place through iteration:
for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) {
    it.advance();
    if ( satisfiesCondition( it.key() ) {
        it.setValue( newValueForKey( it.key() ) );
    }
}

// deleting entries during iteration:
for ( TDoubleIntIterator it = map.iterator(); it.hasNext(); ) {
    it.advance();
    if ( satisfiesCondition( it.key() ) {
        it.remove();
    }
}

// faster iteration by avoiding hasNext():
TDoubleIntIterator iterator = map.iterator();
for ( int i = map.size(); i-- > 0; ) {
    iterator.advance();
    doSomethingWithKeyAndValue( iterator.key(), iterator.value() );
}

New map methods

Trove collections are not a plain vanilla collections. There is always something new around. All examples are given for TDoubleIntMap. First of all, there is

1
int putIfAbsent( double v, int i )
int putIfAbsent( double v, int i )

which proved to be useful in ConcurrentHashMap. It allows user to get rid of containsKey call in

1
2
if ( !map.containsKey( key ) )
    map.put( key, value );
if ( !map.containsKey( key ) )
    map.put( key, value );

making only one map lookup instead of two.

The next useful method is

1
public key adjustOrPutValue( double key, int adjust_amount, int put_amount )
public key adjustOrPutValue( double key, int adjust_amount, int put_amount )

If you are building a frequency map (number of occurrences of values in some collection), you usually write something like:

1
2
3
4
5
6
7
8
9
10
final double[] array = new double[ 100 ];
...
array is filled here
...
final TDoubleIntMap map = new TDoubleIntHashMap( 100 );
for ( final double key : array )
{
    final int cnt = map.get( key );
    map.put( key, cnt + 1 );
}
final double[] array = new double[ 100 ];
...
array is filled here
...
final TDoubleIntMap map = new TDoubleIntHashMap( 100 );
for ( final double key : array )
{
    final int cnt = map.get( key );
    map.put( key, cnt + 1 );
}

With Trove maps this could be simplified to:

1
2
3
4
for ( final double key : array )
{
    map.adjustOrPutValue( key, 1, 1 );
}
for ( final double key : array )
{
    map.adjustOrPutValue( key, 1, 1 );
}

There are two more similar methods:

1
2
public boolean increment( double key )
public boolean adjustValue( double key, int amount )
public boolean increment( double key )
public boolean adjustValue( double key, int amount )

They update value for the given key, but they wouldn’t set it up initially. So, if a key is in the map, its value would be updated and both methods would return true, otherwise nothing would happen and false would be returned. That’s why they are not the best candidates for the previous example.

The last but not the least useful method is

1
boolean retainEntries( TDoubleIntProcedure tDoubleIntProcedure )
boolean retainEntries( TDoubleIntProcedure tDoubleIntProcedure )

If you need to clean a map from some entries, this is the right method for that job. Just provide a predicate which returns true for all entries to be retained and false for all other entries. The only small disadvantage of this method is that it doesn’t compact a map after cleaning, so if you expect to clean a lot of entries, call

1
public void compact()
public void compact()

to shrink and rehash a map.

Hashing strategies (foreign classes, arrays)

Usually maps depend on its keys to override equals and hashCode methods. This is fine as long as you can modify that class. But in some cases you can’t. Either you aren’t allowed to modify a class, or you want to use different logic in this couple of methods. For example, if there is a set of strings representing some sort of ids and all these strings start with some common prefix (state code, for example), when we may want to skip this prefix in order to improve hashCode quality.

Another case is arrays. By default they use implementations inherited from Object. java.util.Arrays provide implementations for two these methods for all array types
(all primitive types and Object). So, in order to use these helper methods, we need to use custom hashing strategy:

1
2
3
4
5
6
7
8
9
10
11
final TObjectIntCustomHashMap<String[]> map = new TObjectIntCustomHashMap<String[]>( new HashingStrategy<String[]>() {
    @Override
    public int computeHashCode( String[] array ) {
        return Arrays.hashCode( array );
    }
 
    @Override
    public boolean equals( String[] ar1, String[] ar2 ) {
        return Arrays.equals( ar1, ar2 );
    }
});
final TObjectIntCustomHashMap<String[]> map = new TObjectIntCustomHashMap<String[]>( new HashingStrategy<String[]>() {
    @Override
    public int computeHashCode( String[] array ) {
        return Arrays.hashCode( array );
    }

    @Override
    public boolean equals( String[] ar1, String[] ar2 ) {
        return Arrays.equals( ar1, ar2 );
    }
});

Now such map would print 100 in a following code snippet:

1
2
3
4
final String[] array1 = new String[] { "String 1", "String 2" } ;
final String[] array2 = new String[] { "String 1", "String 2" } ;
map.put( array1, 100 );
System.out.println( map.get( array2 ) );
final String[] array1 = new String[] { "String 1", "String 2" } ;
final String[] array2 = new String[] { "String 1", "String 2" } ;
map.put( array1, 100 );
System.out.println( map.get( array2 ) );

Another useful case is to replace java.util.IdentityHashMap. For Object to primitive type mappings we can use TObject<primitive type>CustomHashMap (for example,
TObjectIntCustomHashMap), for object to object mappings – TCustomHashMap. We can implement required hashing strategy for this case manually:

1
2
3
4
5
6
7
8
9
10
11
final TObjectIntCustomHashMap<String> identityMap1 = new TObjectIntCustomHashMap<String>( new HashingStrategy<String>() {
    @Override
    public int computeHashCode( String str ) {
        return System.identityHashCode( str );
    }
 
    @Override
    public boolean equals( String str1, String str2 ) {
        return str1 == str2;
    }
});
final TObjectIntCustomHashMap<String> identityMap1 = new TObjectIntCustomHashMap<String>( new HashingStrategy<String>() {
    @Override
    public int computeHashCode( String str ) {
        return System.identityHashCode( str );
    }

    @Override
    public boolean equals( String str1, String str2 ) {
        return str1 == str2;
    }
});

But it has been already implemented by library developers:

1
final TObjectIntCustomHashMap<String> identityMap2 = new TObjectIntCustomHashMap<String>( new IdentityHashingStrategy<String>() );
final TObjectIntCustomHashMap<String> identityMap2 = new TObjectIntCustomHashMap<String>( new IdentityHashingStrategy<String>() );

IdentityHashMap uses its own data structure to store data – keys and values are interleaved in the same array. This is CPU cache-friendly solution. It also doesn’t keep redundant Map.Entry objects, so it consumes less memory than java.util.HashMap. Here is a small test. We would add 10M mappings from string to int to three maps:
IdentityHashMap, HashMap, TObjectIntCustomHashMap. Of course, this is just a memory consumption test, HashMap can not substitute IdentityHashMap.

1
2
3
4
for ( int i = 0; i < 10 000 000; ++i )
{
    map1.put( "string #" + i, i );
}
for ( int i = 0; i < 10 000 000; ++i )
{
    map1.put( "string #" + i, i );
}
JDK HashMap<String, Integer> JDK IdentityHashMap<Integer> Trove TObjectIntCustomHashMap<String>
1050M 1293M 893M

Summary

The main reason to use Trove maps is reduced memory consumption. If there is a large array list/set/map with keys or values that could be a primitive type, it is worth replacing it with Trove collection. If there are some maps from a primitive type to a primitive type, it is especially worth to replace it.

Trove maps and sets support custom hashing strategies which allow to implement map/set specific equals and hashCode, for example to implement identity set or map.

Trove collections implement several additional methods, like grep, retainEntries or adjustOrPutValue. They allow to reduce code required for many common tasks.

See also

Java collections overview – an overview of all standard JDK collections.

Map.containsKey/Set.contains – improvements described in that chapter are applicable to Trove sets and maps.

Memory consumption of popular Java data types – part 2: memory consumption of HashMap / HashSet, LinkedHashMap / LinkedHashSet, TreeMap / TreeSet and PriorityQueue JDK classes in Java 7 as well as their Trove replacements.


2 thoughts on “Trove library: using primitive collections for performance

  1. Pingback: Bit sets   | Java Performance Tuning Guide

  2. Pingback: Use case: Optimizing memory footprint of read only csv file (Trove, Unsafe, ByteBuffer, data compression)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code lang=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" extra="">