String.intern in Java 7 and 8 – part 3

by Mikhail Vorontsov

I want to return to String.intern method I have discussed earlier ( part 1, part 2 ). During the last couple of months I have used interning heavily in my pet project in order to see the pros and cons of using String.intern for every non-temporary string in my application (non-temporary – the one which is expected to live longer than for a few seconds and most likely get into the GC old generation).

As I have written before, the advantages of String.intern in Java 7 and 8 are:

  • Rather fast execution, nearly no performance loss in the multithreaded mode (still using the global string pool).
  • Saving memory, thus allowing your dataset to shrink, which will let your application to run faster (in general).

The main disadvantages of this method (as I have noticed before) are:

  • The requirement to set -XX:StringTableSize=N JVM parameter in advance and dealing with its fixed value (needs JVM restart in order to expand the JVM string pool).
  • Adding calls to String.intern in a lot of places globally (probably via your own wrapper) – which has linear code maintenance cost.

After a couple of months of using String.intern in my project, I think that its usage should be limited to the fields having a limited domain of distinct values (like person first names or state/province names). We should not try to intern anything with a low probability of a repeating value – it would be a waste of CPU time.

For example, suppose you are writing some personal data management tool for government (so you have a lot of non-empty fields compared, to, say, social networks registrations).

Supposing you have to keep all data in memory, it definitely makes sense to intern:

  • Person first name – even in multinational countries, like Australia, number of large national groups is rather low, which keeps the total number of first names in use under a few thousands with less than a thousand of popular ones.
  • Person surname – better in China, a bit worse in other countries, but the likelihood of repetitions is still good enough.
  • House/apartment number – in most countries they may include letters, but are generally based on numbers increasing from 1, which means a low number of distinct values.
  • Street name (without street type, like ‘road’/’avenue’/’street’) – there are not too many of them.
  • Suburb/town name – even less of those
  • State/region/province – just a few of these

On the other hand, if you can not afford to split all data into such small pieces, it may be better not to intern it at all. For example, the full street address, like ‘100 King st’ is much more unique than ‘100’ or ‘King’.

Let’s compare adding strings to the ordinary JDK HashMap with adding the interned strings to the same map. It will more or less highlight the cost of interning the unique strings. I will use my workstation with Intel Xeon E5-2650 CPU (8 physical, 16 virtual cores @ 2 Ghz) and 128 Gb RAM for this test. You will need to set -Xmx and -Xms parameters to the same value in order to minimize the garbage collection.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
private static void testInsertVsIntern()
{
    //in order to compile these methods
    testMapInsertion( 100 * 1000 );
    testMapInsertionIntern( 100 * 1000 );
    System.gc();
 
    System.out.println( "Now real run" );
 
    testMapInsertion( 50 * 1000 * 1000 + 100 );
    System.gc();
    testMapInsertionIntern( 50 * 1000 * 1000 + 100 );
}
 
private static void testMapInsertion( final int cnt )
{
    final Map<Integer, String> map = new HashMap<Integer, String>( cnt );
    long start = System.currentTimeMillis();
    for ( int i = 0; i < cnt; ++i )
    {
        final String str = Integer.toString( i );
        map.put( i, str );
        if ( i % 1000000 == 0 ) //1M
        {
            System.out.println( i + "; time (insert) = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
            start = System.currentTimeMillis();
        }
    }
    System.out.println( "Total length = " + map.size() );
}
 
private static void testMapInsertionIntern( final int cnt )
{
    final Map<Integer, String> map = new HashMap<Integer, String>( cnt );
    long start = System.currentTimeMillis();
    for ( int i = 0; i < cnt; ++i )
    {
        final String str = Integer.toString( i );
        map.put( i, str.intern() ); //here is the difference!
        if ( i % 1000000 == 0 ) //1M
        {
            System.out.println( i + "; time (intern) = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
            start = System.currentTimeMillis();
        }
    }
    System.out.println( "Total length = " + map.size() );
}
private static void testInsertVsIntern()
{
    //in order to compile these methods
    testMapInsertion( 100 * 1000 );
    testMapInsertionIntern( 100 * 1000 );
    System.gc();

    System.out.println( "Now real run" );

    testMapInsertion( 50 * 1000 * 1000 + 100 );
    System.gc();
    testMapInsertionIntern( 50 * 1000 * 1000 + 100 );
}

private static void testMapInsertion( final int cnt )
{
    final Map<Integer, String> map = new HashMap<Integer, String>( cnt );
    long start = System.currentTimeMillis();
    for ( int i = 0; i < cnt; ++i )
    {
        final String str = Integer.toString( i );
        map.put( i, str );
        if ( i % 1000000 == 0 ) //1M
        {
            System.out.println( i + "; time (insert) = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
            start = System.currentTimeMillis();
        }
    }
    System.out.println( "Total length = " + map.size() );
}

private static void testMapInsertionIntern( final int cnt )
{
    final Map<Integer, String> map = new HashMap<Integer, String>( cnt );
    long start = System.currentTimeMillis();
    for ( int i = 0; i < cnt; ++i )
    {
        final String str = Integer.toString( i );
        map.put( i, str.intern() ); //here is the difference!
        if ( i % 1000000 == 0 ) //1M
        {
            System.out.println( i + "; time (intern) = " + ( System.currentTimeMillis() - start ) / 1000.0 + " sec" );
            start = System.currentTimeMillis();
        }
    }
    System.out.println( "Total length = " + map.size() );
}

As you can see, the only difference between 2 test methods is String.intern() call in the testMapInsertionIntern method. Otherwise both methods are identical.

The first test simply adds Integer + String pairs to the map. It took 0.065-0.07 sec for 1 million pairs during the whole test run (this time also includes int -> String conversions), which means a steady speed of approximately 16M pairs/second.

I have used -XX:StringTableSize=1000003 (one million and three) as a JVM string pool size. I got the following results (there was only one minor garbage collection during the test):

1000000; time (intern) = 0.231 sec
2000000; time (intern) = 0.251 sec
3000000; time (intern) = 0.268 sec
4000000; time (intern) = 0.285 sec
5000000; time (intern) = 0.311 sec
6000000; time (intern) = 0.333 sec
7000000; time (intern) = 0.369 sec
8000000; time (intern) = 0.399 sec
9000000; time (intern) = 0.444 sec
10000000; time (intern) = 0.507 sec
11000000; time (intern) = 0.532 sec
12000000; time (intern) = 0.614 sec
13000000; time (intern) = 0.686 sec
14000000; time (intern) = 0.797 sec
15000000; time (intern) = 0.837 sec
16000000; time (intern) = 0.902 sec
17000000; time (intern) = 0.962 sec
18000000; time (intern) = 1.019 sec
19000000; time (intern) = 1.083 sec
20000000; time (intern) = 1.121 sec
21000000; time (intern) = 1.204 sec
22000000; time (intern) = 1.226 sec
23000000; time (intern) = 1.292 sec
24000000; time (intern) = 1.312 sec
25000000; time (intern) = 1.379 sec
26000000; time (intern) = 1.444 sec
27000000; time (intern) = 1.491 sec
28000000; time (intern) = 1.542 sec
29000000; time (intern) = 1.569 sec
30000000; time (intern) = 1.732 sec
31000000; time (intern) = 1.74 sec
32000000; time (intern) = 1.735 sec
33000000; time (intern) = 1.842 sec
34000000; time (intern) = 1.893 sec
35000000; time (intern) = 1.989 sec
36000000; time (intern) = 1.971 sec
37000000; time (intern) = 2.033 sec
38000000; time (intern) = 2.139 sec
[GC 4195274K->4207538K(16078208K), 5.2907230 secs]
39000000; time (intern) = 7.46 sec
40000000; time (intern) = 2.259 sec
41000000; time (intern) = 2.28 sec
42000000; time (intern) = 2.346 sec
43000000; time (intern) = 2.394 sec
44000000; time (intern) = 2.414 sec
45000000; time (intern) = 2.492 sec
46000000; time (intern) = 2.536 sec
47000000; time (intern) = 2.619 sec
48000000; time (intern) = 2.654 sec
49000000; time (intern) = 2.673 sec
50000000; time (intern) = 2.775 sec

As you can see, it takes 3.5 times longer to process the first million of strings and more for subsequent strings. Going back to persons/addresses example, it means 3.5-4 times longer processing for full street addresses with nearly no benefit (most of such addresses will be unique).

See also

String.intern in Java 6, 7 and 8 - string pooling article describing the implementation and benefits of using String.intern() method in Java 7 and 8.

String.intern in Java 6, 7 and 8 - multithreaded access article describing the performance characteristics of the multithreaded access to String.intern().

Summary

  • Despite serious optimizations done in the String.intern() implementation in Java 7+, it still takes a noticeable time to run (noticeable for CPU sensitive applications). The simple example in this article runs 3.5 times faster without calls to String.intern(). You should not use String.intern() as a safety net, passing every long living string into it. Instead process only fields with a limited number of possible distinct values (for example, states/provinces if processing addresses) - memory savings in this situation will definitely pay off the initial CPU costs of String.intern().