hashCode method performance tuning

by Mikhail Vorontsov

In this chapter we will discuss various implications of hashCode method implementation on application performance.

The main purpose of hashCode method is to allow an object to be a key in the hash map or a member of a hash set. In this case an object should also implement equals(Object) method, which is consistent with hashCode implementation:

  • If a.equals(b) then a.hashCode() == b.hashCode()
  • If hashCode() was called twice on the same object, it should return the same result provided that the object was not changed

hashCode from performance point of view

From the performance point of view, the main objective for your hashCode method implementation is to minimize the number of objects sharing the same hash code. All JDK hash based collections store their values in an array. Hash code is used to calculate an initial lookup position in this array. After that equals is used to compare given value with values stored in the internal array. So, if all values have distinct hash codes, this will minimize the possibility of hash collisions. On the other hand, if all values will have the same hash code, hash map (or set) will degrade into a list with operations on it having O(n2) complexity.

For more details, read about collision resolution in hash maps. JDK is using a method called open addressing, but there is another method called “chaining” – all key-value pairs with the same hash code are stored in a linked list.

Let’s see the difference in hash code quality. We will compare a normal String with a string wrapper, which overrides hashCode method in order to return the same hash code for all objects.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
private static class SlowString
{
    public final String m_str;
 
    public SlowString( final String str ) {
        this.m_str = str;
    }
 
    @Override
    public int hashCode() {
        return 37;
    }
 
    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        final SlowString that = ( SlowString ) o;
        return !(m_str != null ? !m_str.equals(that.m_str) : that.m_str != null);
    }
}
private static class SlowString
{
    public final String m_str;

    public SlowString( final String str ) {
        this.m_str = str;
    }

    @Override
    public int hashCode() {
        return 37;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        final SlowString that = ( SlowString ) o;
        return !(m_str != null ? !m_str.equals(that.m_str) : that.m_str != null);
    }
}

Here is a testing method. It worth quoting because we will use it again later. It accepts a prepared list of objects (in order not to include these objects creation time in our test) and calls Map.put followed by Map.containsKey on each value in the list.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private static void testMapSpeed( final List lst, final String name )
{
    final Map<Object, Object> map = new HashMap<Object, Object>( lst.size() );
    int cnt = 0;
    final long start = System.currentTimeMillis();
    for ( final Object obj : lst )
    {
        map.put( obj, obj );
        if ( map.containsKey( obj ) )
            ++cnt;
    }
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Time for "  + name + " is " + time / 1000.0 + " sec, cnt = " + cnt );
}
private static void testMapSpeed( final List lst, final String name )
{
    final Map<Object, Object> map = new HashMap<Object, Object>( lst.size() );
    int cnt = 0;
    final long start = System.currentTimeMillis();
    for ( final Object obj : lst )
    {
        map.put( obj, obj );
        if ( map.containsKey( obj ) )
            ++cnt;
    }
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Time for "  + name + " is " + time / 1000.0 + " sec, cnt = " + cnt );
}

Both String and SlowString objects are created in a loop as "ABCD" + i. It took 0.041 sec to process 100,000 String objects. As for the same number of SlowString objects, it took 82.5 seconds to process them.

As it turned out, String class has exceptional quality hashCode method. Let’s write another test. We will create a list of Strings. First half of them will be equal to "ABCdef*&" + i, second half – "ABCdef*&" + i + "ghi" (to ensure that changes in the middle of the string with a constant tail will not decrease hash code quality). We will create 1M, 5M, 10M and 20M strings and see how many of them will share hash codes and how many strings will share the same hash code. This is test output:

Number of duplicate hashCodes for 1000000 strings = 0

Number of duplicate hashCodes for 5000000 strings = 196
Number of hashCode duplicates = 2 count = 196

Number of duplicate hashCodes for 10000000 strings = 1914
Number of hashCode duplicates = 2 count = 1914

Number of duplicate hashCodes for 20000000 strings = 17103
Number of hashCode duplicates = 2 count = 17103

So, as you can see, only a very small number of strings is sharing the same hash code and it is very unlikely that one hash code will be shared by more than two strings (unless they are specially crafted, of course). Of course, your data may be different – just run a similar test on your typical keys.

Autogenerated hashCode for long fields

It is worth mentioning how hashCode method is generated for long datatype by most of IDEs. Here is a generated hashCode method for a class with 2 long fields:

1
2
3
4
5
public int hashCode() {
    int result = (int) (val1 ^ (val1 >>> 32));
    result = 31 * result + (int) (val2 ^ (val2 >>> 32));
    return result;
}
public int hashCode() {
    int result = (int) (val1 ^ (val1 >>> 32));
    result = 31 * result + (int) (val2 ^ (val2 >>> 32));
    return result;
}

And here is the similar method generated for a class with 2 int fields:

1
2
3
4
5
public int hashCode() {
    int result = val1;
    result = 31 * result + val2;
    return result;
}
public int hashCode() {
    int result = val1;
    result = 31 * result + val2;
    return result;
}

As you see, long is treated differently. Similar code is used in java.util.Arrays.hashCode(long a[]). Actually, you will get better hash code distribution if you will extract high and low 32 bits of long and treat them as int while calculating a hash code. Here is an improved hashCode method for a class with 2 long fields (note that this method runs slower than an original method, but quality of new hash codes will allow hash collections to run faster even at the expense of hashCode slowdown).

1
2
3
4
5
6
public int hashCode() {
    int result = (int) val1;
    result = 31 * result + (int) (val1 >>> 32);
    result = 31 * result + (int) val2;
    return 31 * result + (int) (val2 >>> 32);
}
public int hashCode() {
    int result = (int) val1;
    result = 31 * result + (int) (val1 >>> 32);
    result = 31 * result + (int) val2;
    return 31 * result + (int) (val2 >>> 32);
}

Here are results of testMapSpeed method for processing 10M objects of all three kinds. They were initialized with the same values (all longs actually fit into int values).

Two longs with original hashCode Two longs with modified hashCode Two ints
2.596 sec 1.435 sec 0.737 sec

As you can see, hashCode update makes a difference. Not so big, but worth noticing for performance critical code.

Use case: How to benefit from String.hashCode quality

Let’s assume we have a map from string identifiers to some values. Map keys (string identifiers) are not stored anywhere else in memory (at most only some of them may be stored somewhere else at a time). We have already collected all map entries, for example, on the first pass of some two phase algorithm. On the second phase we will need to query map values by keys. We will query our map only using existing map keys.

How can we improve the map? As you have seen before, String.hashCode returns mostly distinct values. We can scan all keys, calculate hash codes of all keys and find not unique hash codes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Map<Integer, Integer> cnt = new HashMap<Integer, Integer>( max );
for ( final String s : dict.keySet() )
{
    final int hash = s.hashCode();
    final Integer count = cnt.get( hash );
    if ( count != null )
        cnt.put( hash, count + 1 );
    else
        cnt.put( hash, 1 );
}
 
//keep only not unique hash codes
final Map<Integer, Integer> mult = new HashMap<Integer, Integer>( 100 );
for ( final Map.Entry<Integer, Integer> entry : cnt.entrySet() )
{
    if ( entry.getValue() > 1 )
        mult.put( entry.getKey(), entry.getValue() );
}
Map<Integer, Integer> cnt = new HashMap<Integer, Integer>( max );
for ( final String s : dict.keySet() )
{
    final int hash = s.hashCode();
    final Integer count = cnt.get( hash );
    if ( count != null )
        cnt.put( hash, count + 1 );
    else
        cnt.put( hash, 1 );
}

//keep only not unique hash codes
final Map<Integer, Integer> mult = new HashMap<Integer, Integer>( 100 );
for ( final Map.Entry<Integer, Integer> entry : cnt.entrySet() )
{
    if ( entry.getValue() > 1 )
        mult.put( entry.getKey(), entry.getValue() );
}

Now we can create 2 maps out of the old one. Let’s assume for simplicity that old map values were just Objects. In this case, we will end up with Map<Integer, Object> and Map<String, Object> (for production code Trove TIntObjectHashMap is recommended instead of Map<Integer, Object>). First map will contain mapping from unique hash codes to values, second map – mapping from strings with not unique hash codes to values.

1
2
3
4
5
6
7
8
9
10
11
12
final Map<Integer, Object> unique = new HashMap<Integer, Object>( 1000 );
final Map<String, Object> not_unique = new HashMap<String, Object>( 1000 );
 
//dict - original map
for ( final Map.Entry<String, Object> entry : dict.entrySet() )
{
    final int hashCode = entry.getKey().hashCode();
    if ( mult.containsKey( hashCode ) )
        not_unique.put( entry.getKey(), entry.getValue() );
    else
        unique.put( hashCode, entry.getValue() );
}
final Map<Integer, Object> unique = new HashMap<Integer, Object>( 1000 );
final Map<String, Object> not_unique = new HashMap<String, Object>( 1000 );

//dict - original map
for ( final Map.Entry<String, Object> entry : dict.entrySet() )
{
    final int hashCode = entry.getKey().hashCode();
    if ( mult.containsKey( hashCode ) )
        not_unique.put( entry.getKey(), entry.getValue() );
    else
        unique.put( hashCode, entry.getValue() );
}

Now, in order to get a value, we need to query unique map first and not_unique map if first query has not returned a valid result:

1
2
3
4
5
6
7
8
public Object get( final String key )
{
    final int hashCode = key.hashCode();
    Object value = m_unique.get( hashCode );
    if ( value == null )
        value = m_not_unique.get( key );
    return value;
}
public Object get( final String key )
{
    final int hashCode = key.hashCode();
    Object value = m_unique.get( hashCode );
    if ( value == null )
        value = m_not_unique.get( key );
    return value;
}

In some rare cases you may still have too many (by your opinion) string keys in the non_unique map. In this case first of all try replacing hash code calculation with either java.util.zip.CRC32 or java.util.zip.Adler32 calculation (Adler32 is faster than CRC32, but has a bit worse distribution). As a last resort, try combining 2 independent functions in one long key: lower 32 bits for one function, higher 32 bits for another. The choice of hash functions is obvious: Object.hashCode, java.util.zip.CRC32 or java.util.zip.Adler32.

How to compress a set even better than a map

Previous use case discussed how to get rid of keys in a map. Actually, we may achieve even better results for sets. I can see 2 cases when sets may be used: first is splitting an original set into several subsets and querying if an identifier belongs to a given subset; second is writing a spellchecker – you will query your set with any unforeseeable values (that’s the nature of spellcheckers), but some mistakes are not critical (if another word has the same hash code as your allegedly unique hash code, you could report such word as correct). In both cases sets will be extremely useful for us.

If we will just apply previous logic to sets, we will end up with Set<Integer> for unique hash codes and Set<String> for non-unique ones. We can optimize at least a couple of things here.

If we will limit range of values generated by hashCode method to some limited number (2^20 if fine, for example), then we can replace a Set<Integer> with a BitSet, as it was discussed in Bit sets article. We can always select sufficient limit for hash code if we know size of our original set in advance (before compression).

The next objective is to check how many identifiers with non-unique has codes you still have. Improve your hashCode method or increase a range of allowed hash code values if you have too many of non-unique hash codes. In the perfect case all your identifiers will have unique hash codes (it is not too difficult to achieve, by the way). As a result, you will be able to use BitSets instead of large sets of strings.

See also

A new hashing method was added to String class in Java 1.7.0_06. For more details see Changes to String internal representation made in Java 1.7.0_06 article.

Summary

Try to improve distribution of results of your hashCode method. This is far more important than to optimize that method speed. Never write a hashCode method which returns a constant.

String.hashCode results distribution is nearly perfect, so you can sometimes substitute Strings with their hash codes. If you are working with sets of strings, try to end up with BitSets, as described in this article. Performance of your code will greatly improve.


2 thoughts on “hashCode method performance tuning

  1. Pingback: String.intern in Java 6, 7 and 8 - string pooling   | Java Performance Tuning Guide

  2. Pingback: Trove library: using primitive collections for performance | Java Performance Tuning Guide

Leave a Reply

Your email address will not be published. Required fields are marked *