Bit sets

by Mikhail Vorontsov

From time to time you may need to map an integer key to one or a few flags (boolean values). Bit sets are the best data structure in such cases.

There is only one bit set implementation in the core Java: java.util.BitSet. Internally it is using a long[], each bit of which maps to one integer: bit 0 in the first long is mapped to key 0, bit 1 – to key 1 and so on. This means that if you want to map keys starting from some high number, like 100,000,000 to booleans, then an unchanged BitSet may not be a best choice, because it will allocate all bits for keys from 0 to highest bit you’ve ever set in your bit set. For a bit set with a single set bit at position 100,000,000 it means allocation of about 12.5M memory.

Problem of high base key can be avoided be adding and subtracting this base key on all bit set operations (but, of course, you’ll need to know this base key in advance). The same should be done if you need to map negative values to booleans: java.util.BitSet does not support negative keys.

The third problem to avoid is mapping of very large values, which are outside of int values range. Ordinary java.util.BitSet supports only int keys.

And the last, but not the least is partitioning problem – if you need to store a few values here and a few values there – you still need to allocate all the memory to store keys from 0 to a highest set bit.


LongBitSet to deal with BitSet problems

Let’s try to deal with these problems. We will implement a LongBitSet – a class similar to java.util.BitSet. The only difference is that its keys will be long instead of int.

A LongBitSet will be backed by Map<Long, BitSet>. The idea is to use high bits of a key as a key to this map and low bits as a key to a value BitSet. Thus we will support both partitioning and long keys. Problem is solved.

What about size of each BitSet? It depends on your data, but keeping up to a million keys in each BitSet is usually a good choice. Each bit set will require no more than 128K, keeping total size of a LongBitSet rather small.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
public class LongBitSet
{
    /** Number of bits allocated to a value in an index */
    private static final int VALUE_BITS = 20; //1M values per bit set
    /** Mask for extracting values */
    private static final long VALUE_MASK = ( 1 << VALUE_BITS ) - 1;
 
    /**
     * Map from a value stored in high bits of a long index to a bit set mapped to the lower bits of an index.
     * Bit sets size should be balanced - not to long (otherwise setting a single bit may waste megabytes of memory)
     * but not too short (otherwise this map will get too big). Update value of {@code VALUE_BITS} for your needs.
     * In most cases it is ok to keep 1M - 64M values in a bit set, so each bit set will occupy 128Kb - 8Mb.
     */
    private Map<Long, BitSet> m_sets = new HashMap<Long, BitSet>( 20 );
 
    /**
     * Get set index by long index (extract bits 20-63)
     * @param index Long index
     * @return Index of a bit set in the inner map
     */
    private long getSetIndex( final long index )
    {
        return index >> VALUE_BITS;
    }
 
    /**
     * Get index of a value in a bit set (bits 0-19)
     * @param index Long index
     * @return Index of a value in a bit set
     */
    private int getPos( final long index )
    {
        return (int) (index & VALUE_MASK);
    }
 
    /**
     * Helper method to get (or create, if necessary) a bit set for a given long index
     * @param index Long index
     * @return A bit set for a given index (always not null)
     */
    private BitSet bitSet( final long index )
    {
        final Long iIndex = getSetIndex( index );
        BitSet bitSet = m_sets.get( iIndex );
        if ( bitSet == null )
        {
            bitSet = new BitSet( 1024 );
            m_sets.put( iIndex, bitSet );
        }
        return bitSet;
    }
 
    /**
     * Set a given value for a given index
     * @param index Long index
     * @param value Value to set
     */
    public void set( final long index, final boolean value )
    {
        if ( value )
            bitSet( index ).set( getPos( index ), value );
        else
        {  //if value shall be cleared, check first if given partition exists
            final BitSet bitSet = m_sets.get( getSetIndex( index ) );
            if ( bitSet != null )
                bitSet.clear( getPos( index ) );
        }
    }
 
    /**
     * Get a value for a given index
     * @param index Long index
     * @return Value associated with a given index
     */
    public boolean get( final long index )
    {
        final BitSet bitSet = m_sets.get( getSetIndex( index ) );
        return bitSet != null && bitSet.get( getPos( index ) );
    }
 
    /**
     * Clear all bits between {@code fromIndex} (inclusive) and {@code toIndex} (exclusive)
     * @param fromIndex Start index (inclusive)
     * @param toIndex End index (exclusive)
     */
    public void clear( final long fromIndex, final long toIndex )
    {
        if ( fromIndex >= toIndex ) return;
        final long fromPos = getSetIndex( fromIndex );
        final long toPos = getSetIndex( toIndex );
        //remove all maps in the middle
        for ( long i = fromPos + 1; i < toPos; ++i )
            m_sets.remove( i );
        //clean two corner sets manually
        final BitSet fromSet = m_sets.get( fromPos );
        final BitSet toSet = m_sets.get( toPos );
        ///are both ends in the same subset?
        if ( fromSet != null && fromSet == toSet )
        {
            fromSet.clear( getPos( fromIndex ), getPos( toIndex ) );
            return;
        }
        //clean left subset from left index to the end
        if ( fromSet != null )
            fromSet.clear( getPos( fromIndex ), fromSet.length() );
        //clean right subset from 0 to given index. Note that both checks are independent
        if ( toSet != null )
            toSet.clear( 0, getPos( toIndex ) );
    }
 
    /**
     * Iteration over all set values in a LongBitSet. Order of iteration is not specified.
     * @param proc Procedure to call. If it returns {@code false}, then iteration will stop at once
     */
    public void forEach( final LongProcedure proc )
    {
        for ( final Map.Entry<Long, BitSet> entry : m_sets.entrySet() )
        {
            final BitSet bs = entry.getValue();
            final long baseIndex = entry.getKey() << VALUE_BITS;
            for ( int i = bs.nextSetBit( 0 ); i >= 0; i = bs.nextSetBit( i + 1 ) ) {
                if ( !proc.forEntry( baseIndex + i ) )
                    return;
            }
        }
    }
}
public class LongBitSet
{
    /** Number of bits allocated to a value in an index */
    private static final int VALUE_BITS = 20; //1M values per bit set
    /** Mask for extracting values */
    private static final long VALUE_MASK = ( 1 << VALUE_BITS ) - 1;

    /**
     * Map from a value stored in high bits of a long index to a bit set mapped to the lower bits of an index.
     * Bit sets size should be balanced - not to long (otherwise setting a single bit may waste megabytes of memory)
     * but not too short (otherwise this map will get too big). Update value of {@code VALUE_BITS} for your needs.
     * In most cases it is ok to keep 1M - 64M values in a bit set, so each bit set will occupy 128Kb - 8Mb.
     */
    private Map<Long, BitSet> m_sets = new HashMap<Long, BitSet>( 20 );

    /**
     * Get set index by long index (extract bits 20-63)
     * @param index Long index
     * @return Index of a bit set in the inner map
     */
    private long getSetIndex( final long index )
    {
        return index >> VALUE_BITS;
    }

    /**
     * Get index of a value in a bit set (bits 0-19)
     * @param index Long index
     * @return Index of a value in a bit set
     */
    private int getPos( final long index )
    {
        return (int) (index & VALUE_MASK);
    }

    /**
     * Helper method to get (or create, if necessary) a bit set for a given long index
     * @param index Long index
     * @return A bit set for a given index (always not null)
     */
    private BitSet bitSet( final long index )
    {
        final Long iIndex = getSetIndex( index );
        BitSet bitSet = m_sets.get( iIndex );
        if ( bitSet == null )
        {
            bitSet = new BitSet( 1024 );
            m_sets.put( iIndex, bitSet );
        }
        return bitSet;
    }

    /**
     * Set a given value for a given index
     * @param index Long index
     * @param value Value to set
     */
    public void set( final long index, final boolean value )
    {
        if ( value )
            bitSet( index ).set( getPos( index ), value );
        else
        {  //if value shall be cleared, check first if given partition exists
            final BitSet bitSet = m_sets.get( getSetIndex( index ) );
            if ( bitSet != null )
                bitSet.clear( getPos( index ) );
        }
    }

    /**
     * Get a value for a given index
     * @param index Long index
     * @return Value associated with a given index
     */
    public boolean get( final long index )
    {
        final BitSet bitSet = m_sets.get( getSetIndex( index ) );
        return bitSet != null && bitSet.get( getPos( index ) );
    }

    /**
     * Clear all bits between {@code fromIndex} (inclusive) and {@code toIndex} (exclusive)
     * @param fromIndex Start index (inclusive)
     * @param toIndex End index (exclusive)
     */
    public void clear( final long fromIndex, final long toIndex )
    {
        if ( fromIndex >= toIndex ) return;
        final long fromPos = getSetIndex( fromIndex );
        final long toPos = getSetIndex( toIndex );
        //remove all maps in the middle
        for ( long i = fromPos + 1; i < toPos; ++i )
            m_sets.remove( i );
        //clean two corner sets manually
        final BitSet fromSet = m_sets.get( fromPos );
        final BitSet toSet = m_sets.get( toPos );
        ///are both ends in the same subset?
        if ( fromSet != null && fromSet == toSet )
        {
            fromSet.clear( getPos( fromIndex ), getPos( toIndex ) );
            return;
        }
        //clean left subset from left index to the end
        if ( fromSet != null )
            fromSet.clear( getPos( fromIndex ), fromSet.length() );
        //clean right subset from 0 to given index. Note that both checks are independent
        if ( toSet != null )
            toSet.clear( 0, getPos( toIndex ) );
    }

    /**
     * Iteration over all set values in a LongBitSet. Order of iteration is not specified.
     * @param proc Procedure to call. If it returns {@code false}, then iteration will stop at once
     */
    public void forEach( final LongProcedure proc )
    {
        for ( final Map.Entry<Long, BitSet> entry : m_sets.entrySet() )
        {
            final BitSet bs = entry.getValue();
            final long baseIndex = entry.getKey() << VALUE_BITS;
            for ( int i = bs.nextSetBit( 0 ); i >= 0; i = bs.nextSetBit( i + 1 ) ) {
                if ( !proc.forEntry( baseIndex + i ) )
                    return;
            }
        }
    }
}

Iteration of set bits in a java.util.BitSet is usually done so:

1
2
3
for ( int i = bs.nextSetBit( 0 ); i >= 0; i = bs.nextSetBit( i + 1 ) ) {
    //i is a key which bit was set
}
for ( int i = bs.nextSetBit( 0 ); i >= 0; i = bs.nextSetBit( i + 1 ) ) {
    //i is a key which bit was set
}

In case of LongBitSet it is easier to implement visitor-like approach:

1
2
3
4
5
6
7
8
9
10
public void printAllSetBits()
{
    forEach( new LongProcedure() {
        @Override
        public boolean forEntry( final long value ) {
            System.out.println( value );
            return true;
        }
    });
}
public void printAllSetBits()
{
    forEach( new LongProcedure() {
        @Override
        public boolean forEntry( final long value ) {
            System.out.println( value );
            return true;
        }
    });
}

LongBitSet use cases

The simplest case is, of course, storing one flag per integer key. With LongBitSet you can keep a mapping from few dense values here and another few there to a single flag, which is better from a memory consumption point of view than an ordinary BitSet.

Actually, there is another extremely similar case – check your code for Set<Short/Integer/Long>. They are logically identical to bit sets: a value is either present or absent from a set, so we have the same mapping from an integer to boolean here. A table telling how much memory is occupied by a map of 10M integers is copied here from Trove chapter:

JDK HashSet<Integer> Trove THashSet<Integer> Trove TIntSet
525M 236M 103M

A bit set of the same size will occupy 1.25Mb. From this we can conclude that HashSet<Integer> should be replaced with a bit set if it has one set bit in less than 400 bits. Even for most memory optimal TIntSet the ratio is still about 80 to 1. This means that in most cases it worth replacing a set of integer numbers with a bit set.

The last case is storing several flags (less than 8) per integer key. You need several separate bit sets to keep several flags per key. If you need to store a several bits value (when all bits are parts of a single value), it is better to use an array or a map. In that case it worth looking at Use case: how to compact a long-to-long mapping chapter for some ideas.

See also

Java collections overview – an overview of all standard JDK collections.

Summary

Do not forget about bit sets when you need to map a large number of integer keys to boolean flags.

Sets of integer values should be replaced with bit sets in a lot of cases in order to save a lot of memory.


3 thoughts on “Bit sets

  1. SV

    I was looking for a way to create long indexed bitset, spot on. Thank you.
    One bug though, the method getPos() is doing ANDing, shouldn’t it be doing a modulus?


    private int getPos( final long index )
    {
    return (int) (index % VALUE_MASK);
    }

    Reply
    1. admin Post author

      Yes, your version is correct, and, assuming that JRE will replace division by power of 2 with a more efficient operation, is even fast, but I intended to define MASK_VALUE as 2^N – 1:
      private static final long VALUE_MASK = ( 1 < < VALUE_BITS ) - 1;

      Reply
  2. Pingback: Trove library: using primitive collections for performance | Java Performance Tuning Guide

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code lang=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" extra="">