Changes to String internal representation made in Java 1.7.0_06

by Mikhail Vorontsov

This post was updated on 19 Nov 2013 in order to reflect Java 8 changes.


This post was updated on 28 Nov 2013 in order to reflect Java 7u40 changes (thanks to Sunny Chan and his colleagues for pointing my attention at this JDK update).

Sharing an underlying char[]

An original String implementation has 4 non static field: char[] value with string characters, int offset and int count with an index of the first character to use from value and a number of characters to use and int hash with a cached value of a String hash code. As you can see, in a very large number of cases a String will have offset = 0 and count = value.length. The only exception to this rule were the strings created via String.substring calls and all API calls using this method internally (like Pattern.split).

String.substring created a String, which shared an internal char[] value with an original String, which allowed you:

  1. To save some memory by sharing character data
  2. To run String.substring in a constant time ( O(1) )

At the same time such feature was a source of a possible memory leak: if you extract a tiny substring from an original huge string and discard that original string, you will still have a live reference to the underlying huge char[] value taken from an original String. The only way to avoid it was to call a new String( String ) constructor on such string – it made a copy of a required section of underlying char[], thus unlinking your shorter string from its longer “parent”.

From Java 1.7.0_06 (as well as in current versions of Java 8 – Nov 13) offset and count fields were removed from a String. This means that you can’t share a part of an underlying char[] value anymore. Now you can forget about a memory leak described above and never ever use new String(String) constructor anymore. As a drawback, you now have to remember that String.substring has now a linear complexity instead of a constant one.



Changes to hashing logic

The rest of this article applies to Java 7u6+ only. This code was removed in Java 8.

There is another change introduced to String class in the same update: a new hashing algorithm. Oracle suggests that a new algorithm gives a better distribution of hash codes, which should improve performance of several hash-based collections: HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet, WeakHashMap and ConcurrentHashMap. Unlike changes from the first part of this article, these changes are experimental and turned off by default.

As you may guess, these changes are only for String keys. If you want to turn them on, you’ll have to set a jdk.map.althashing.threshold system property to a non-negative value (it is equal to -1 by default). This value will be a collection size threshold, after which a new hashing method will be used. A small remark here: hashing method will be changed on rehashing only (when there is no more free space). So, if a collection was rehashed last time at size = 160 and jdk.map.althashing.threshold = 200, then a method will only be changed when your collection will grow to size of 320 (approximately).

String now has a hash32() method, which result is cached in int hash32 field. The biggest difference of this method is that the result of hash32() on the same string may be different on various JVM runs (actually, it will be different in most cases, because it uses a single System.currentTimeMillis() and two System.nanoTime calls for seed initialization). As a result, iteration order on some of your collections will be different each time you run your program.

Actually, I was a little surprised by this method. Why do we need it if an original hashCode method works very good? I decided to try a test program from hashCode method performance tuning article in order to find out how many duplicate hash codes we will have with a hash32 method.

String.hash32() method is not public, so I had to take a look at a HashMap implementation in order to find out how to call this method. The answer is sun.misc.Hashing.stringHash32(String).

The same test on 1 million distinct keys has shown 304 duplicate hash values, compared to no duplicates while using String.hashCode. I think, we need to wait for further improvements or use case descriptions from Oracle.

New hashing may severely affect highly multithreaded code (Java 7u6 to Java 7u40 (exclusive))


This section applies to Java 7 versions between build 6 (inclusive) and build 40 (exclusive). This code was removed in Java 8. See the next section for Java 7u40+ information.

Oracle has made a bug in the implementation of hashing in the following classes: HashMap, Hashtable, HashSet, LinkedHashMap, LinkedHashSet and WeakHashMap. Only ConcurrentHashMap is not affected. The problem is that all non-concurrent classes now have the following field:

1
2
3
4
5
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 */
transient final int hashSeed = sun.misc.Hashing.randomHashSeed(this);
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 */
transient final int hashSeed = sun.misc.Hashing.randomHashSeed(this);

This means that for every created map/set instance sun.misc.Hashing.randomHashSeed method will be called. randomHashSeed, in turn, calls java.util.Random.nextInt method. Random class is well known for its multithreaded unfriendliness: it contains private final AtomicLong seed field. Atomics behave well under low to medium contention, but work extremely bad under heavy contention.

As a result, many highly loaded web applications processing HTTP/JSON/XML requests may be affected by this bug, because nearly all known parsers use one of the affected classes for “name-value” representation. All these format parsers may create nested maps, which further increases the number of maps created per second.

How to solve this problem?

1. ConcurrentHashMap way: call randomHashSeed method only when jdk.map.althashing.threshold system property was defined. Unfortunately, it is available only for JDK core developers.

1
2
3
4
5
6
7
8
9
10
11
12
13
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 */
private transient final int hashSeed = randomHashSeed(this);
 
private static int randomHashSeed(ConcurrentHashMap instance) {
    if (sun.misc.VM.isBooted() && Holder.ALTERNATIVE_HASHING) {
        return sun.misc.Hashing.randomHashSeed(instance);
    }
 
    return 0;
}
/**
 * A randomizing value associated with this instance that is applied to
 * hash code of keys to make hash collisions harder to find.
 */
private transient final int hashSeed = randomHashSeed(this);

private static int randomHashSeed(ConcurrentHashMap instance) {
    if (sun.misc.VM.isBooted() && Holder.ALTERNATIVE_HASHING) {
        return sun.misc.Hashing.randomHashSeed(instance);
    }

    return 0;
}

2. Hacker way: fix sun.misc.Hashing class. Highly not recommended. If you still wish to go ahead, here is an idea: java.util.Random class is not final. You can use java.util.concurrent.ThreadLocalRandom added to Java 7 – it is a thread local subclass of Random using ThreadLocal<ThreadLocalRandom> (thanks to Benjamin Possolo for pointing my attention to not mentioning this class in the original article). Besides being just thread local, ThreadLocalRandom is CPU cache-aware: it adds 64 byte padding (cache line size) after seed in the every ThreadLocalRandom instance (in order to eliminate chances of 2 different seeds ending up in the single cache line).

Then you will have to update sun.misc.Hashing.Holder.SEED_MAKER field – set it to your extended Random class instance. Don’t worry that this field is private, static and final – reflection can help you:

1
2
3
public class Hashing {
    private static class Holder {
        static final java.util.Random SEED_MAKER;
public class Hashing {
    private static class Holder {
        static final java.util.Random SEED_MAKER;

New hashing does not affect highly multithreaded code performance anymore – Java 7u40+

Oracle has fixed the above mentioned problem in Java 7 update 40. Take a look at this Java bug for more information.

They have used way 1 from the previous section – calling sun.misc.Hashing.randomHashSeed(this) only if alternative hashing is turned on. Oracle fixed just 2 classes: HashMap and Hashtable, but it has indirectly fixed HashSet, LinkedHashMap and LinkedHashSet, because these 3 classes are built on top of HashMap. The only not fixed class is WeakHashMap, but I can not imagine a use case for actively creating instances of this class.

See also

This article has caused a large discussion on Reddit recently. I would like to advice readers of this article to take a look at reddit user bondolo comment regarding the reasons behind these changes.

String.intern in Java 6, 7 and 8 – string pooling

Summary

  • From Java 1.7.0_06 String.substring always creates a new underlying char[] value for every String it creates. This means that this method now has a linear complexity compared to previous constant complexity. The advantage of this change is a slightly smaller memory footprint of a String (8 bytes less than before) and a guarantee to avoid memory leaks caused by String.substring (see String packing part 1: converting characters to bytes for more details on Java object memory layout).
  • Java 7u6+ functionality. Removed in Java 8. Starting from the same Java update, String class got a second hashing method called hash32. This method is currently not public and could be accessed without reflection only via sun.misc.Hashing.stringHash32(String) call. This method is used by 7 JDK hash-based collections if their size will exceed jdk.map.althashing.threshold system property. This is an experimental function and currently I don’t recommend using it in your code.
  • Java 7u6 (inclusive) to Java 7u40 (exclusive) functionality. Not applicable to Java 8. All standard JDK non-concurrent maps and sets in all Java versions between Java 7u6 (inclusive) and Java 7u40 (exclusive) are affected by a performance bug caused by new hashing implementation. This bug affects only multithreaded applications creating heaps of maps per second. See this article for more details. This problem was fixed in Java 7u40.

One thought on “Changes to String internal representation made in Java 1.7.0_06

  1. Pingback: 基本类型转String - ImportNew

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code lang=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" extra="">