* Fastest among int-to-int map implementations I have tested in my previous article in the tests I have implemented for that article.
I would like to thank Sebastiano Vigna and Roman Leventov for sharing their hash map wisdom with me. Some implementation ideas were inspired by “Code Optimization: Effective Memory Usage” by Kris Kaspersky.
This article will give you a step by step overview of various implementation tricks used in the modern hash map implementations. At the end of this article you will have a probably fastest Java int-to-int hash map implementation available at the moment of writing of this article.
Most of modern hash maps are based on the idea of open indexing. What does it mean? Your map is based on the array of keys (values will always be placed at the matching array index, so forget about them for now). You have to find your key in the array of keys for each map operation. How does it implemented?
First of all, you need the initial lookup position in the array. It may be calculated by any function which maps a key into an integer in the range
[0, array.length - 1]. A key is usually mapped into an integer by means of
hashCode method. A simplest function here could be
Math.abs(key.hashCode() % array.length) (keep in mind that
% result could be negative).
As you understand, a mapping of large set of keys into a small set of integer values means that you may end up with some collisions (they are called hash collisions) – same results of the initial function for the different keys. Collisions are resolved by trying to apply another function to the original array index. The simplest of such functions is
(prevIdx + 1) % array.length. There is one requirement for such functions – if applied in a loop, they should cover the whole set or array cells, so that you can use the whole array capacity. Another example of such function is incrementing the index by one prime number if the array length is another prime number.
Free and removed cells
In theory, that’s enough to implement your own hash map. In practice, you need to distinguish free and removed cells from occupied cells (you can avoid using removed cells if you’ll do extra work in
remove method – see how it is implemented in the latest FastUtil). Removed cells are also known as “tombstones”.
Your keys array is initially filled with free “cells”. You set a cell into “removed” state if you need to remove an existing key.
Let’s take a look at an example:
int key map uses the initial and next functions defined above:
This map originally contained keys 1, 2, 3 and 4, but key=3 was subsequently removed from a map, so it was replaced with a removed (“R”) placeholder.
Let’s see what should we do to find the following keys:
|2||Start function points at a cell with index=2 at once. We have key=2 at a cell with index=2, so no further lookup is required.|
|3||Start function points at a cell with index=3. This cell is “removed”, so we have to apply “nextIdx” function in a loop until we either find a key or a free cell. We check cell index=4 next – bad luck, key is not equal. Then we check cell index=5: it is a free cell, so we stop the lookup – key is not found.|
Next, let’s see what will happen if we want to add key=10:
initial = key % array.length = 10 % 9 = 1. Cell at index=1 is already occupied with another key, so we can not use it. So is cell at index=2. The cell at index=3 is “removed”, so we can reuse it and put key=10 into it.
Removed cells cleanup
In many cases your hash map may degrade to O(n2) complexity if you would keep the removed cells in the map. Fastest maps are implementing the removed cells cleanup one way or another. As a result, all other map methods will need to distinguish just 2 cell states: free or used. Besides that,
remove method is usually called infrequently compared to
get and less frequently than
put, which means that some extra complexity during key removal will be paid off by fasted execution of other methods. This article will use FastUtil cleanup logic.
The initial index function I have mentioned above (
initial = Math.abs( key % array.length ); ) will put consecutive keys in the consecutive array cells. This is highly undesirable if your next cell function is simply picking up the next array cell, because it will cause the long lookup chains to be created in a pretty common case.
In order to avoid it, we need to “scramble” the key, shuffling its bits. I will rely on FastUtil scrambling code:
As a result, consecutive keys will not be placed in the consecutive array cells, thus keeping the average hash chain length under control. As for “random” keys case, you are likely to end up with a pretty good distribution of keys over the keys array as well.
Now you are definitely ready to implement your own hash map. We will be implementing an
int-int map in the next several sections of this article.