java.util.Random and java.util.concurrent.ThreadLocalRandom in multithreaded environments

by Mikhail Vorontsov

Generating pseudorandom data

There are 2 types of random generators in Java: pseudorandom and secure. Pseudorandom generators are transforming a seed into a new portion of pseudorandom data based on some formula. Secure random generators are using some machine specific sources of actually random events (file/socket access, for example) to generate its data.

Secure random generators:

  • should be used only when cryptographically strong random data is required
  • are slow
  • could be waiting for external events (“stuck”) if you have requested a lot of random data (Linux /dev/random is an example of such generator)

Pseudorandom generators, on the other hand, depend only on the initial “seed” value, so you can generate the same sequence of pseudorandom events if you will provide the same seed to the algorithm. In general case, such generators are fast because they are CPU bound only (do not rely on any IO). We will review the evolution of pseudorandom generators in Java and the reasons behind these changes.

java.util.Random

java.util.Random is available from Java 1.0. It is a thread safe class, so you may share (in theory) instances of this class between several threads and do not expect to get the same random data in 2 threads at the same time. Such thread safety is achieved via using an AtomicLong for the generator seed.

Random uses AtomicLong CAS (compare-and-set) operations for updating its seed. Despite being a light non-blocking primitive used in a lot of non-blocking algorithms, CAS behaves really poorly under the high contention. Wait for test results to see how poorly it behaves.

java.util.concurrent.ThreadLocalRandom

java.util.concurrent.ThreadLocalRandom was added in Java 7 as an attempt to overcome all performance issues associated with java.util.Random. This new class extends java.util.Random, so you may pass it to all logic expecting the old java.util.Random.

Here are main ThreadLocalRandom implementation details:

  • Internally it does not use an AtomicLong seed from Random. Instead it uses an ordinary long.
  • You can not create an instance of ThreadLocalRandom yourself, because its constructor is not public. Instead you should use its static factory ThreadLocalRandom.current(). This factory method queries internal ThreadLocal<ThreadLocalRandom>
  • It is CPU-cache aware, so it uses 8 long dummy fields for padding, thus pushing anything else out of its 64 byte L1 cache line.

All of these changes are important, which will be revealed by the following tests.

Tests

We will test 5 following scenarios:

  1. A single java.util.Random shared between N threads.
  2. ThreadLocal<Random>
  3. java.util.concurrent.ThreadLocalRandom
  4. java.util.Random[], where each thread N uses a Random at index = N.
  5. java.util.Random[], where each thread N uses a Random at index = N * 2.

All these tests share the same testing method packed in RandomTask class. Each scenario just defines how to obtain a random generator to use.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
private static final long COUNT = 10000000;
private static final int THREADS = 2;
 
public static void main(String[] args) {
    System.out.println( "Shared Random" );
    testRandom(THREADS, COUNT);
//  System.out.println("ThreadLocal<Random>");
//  testTL_Random(THREADS, COUNT);
//  System.out.println("ThreadLocalRandom");
//  testTLRandom(THREADS, COUNT);
//  System.out.println("Shared Random[] with no padding");
//  testRandomArray(THREADS, COUNT, 1);
//  System.out.println("Shared Random[] with padding");
//  testRandomArray(THREADS, COUNT, 2);
}
 
//runner for all tests
private static class RandomTask implements Runnable
{
    private final Random rnd;
    protected final int id;
    private final long cnt;
    private final CountDownLatch latch;
 
    private RandomTask(Random rnd, int id, long cnt, CountDownLatch latch) {
        this.rnd = rnd;
        this.id = id;
        this.cnt = cnt;
        this.latch = latch;
    }
 
    protected Random getRandom()
    {
        return rnd;
    }
 
    @Override
    public void run() {
        try {
            final Random r = getRandom();
            latch.countDown();
            latch.await();
            final long start = System.currentTimeMillis();
            int sum = 0;
            for ( long j = 0; j < cnt; ++j )
            {
                sum += r.nextInt();
            }
            final long time = System.currentTimeMillis() - start;
            System.out.println( "Thread #" + id + " Time = " + time / 1000.0 + " sec, sum = " + sum );
        } catch (InterruptedException e) {
        }
    }
}
 
private static void testRandom( final int threads, final long cnt )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    final Random r = new Random( 100 );
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( r, i, cnt, latch ) );
        thread.start();
    }
}
 
private static void testRandomArray( final int threads, final long cnt, final int padding )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    final Random[] rnd = new Random[threads * padding];
    for ( int i = 0; i < threads * padding; ++i ) //allocate together
        rnd[ i ] = new Random( 100 );
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( rnd[ i * padding ], i, cnt, latch ) );
        thread.start();
    }
}
 
private static void testTLRandom( final int threads, final long cnt )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( null, i, cnt, latch ) {
            @Override
            protected Random getRandom() {
                return ThreadLocalRandom.current();
            }
        } );
        thread.start();
    }
}
 
private static void testTL_Random( final int threads, final long cnt )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    final ThreadLocal<Random> rnd = new ThreadLocal<Random>() {
        @Override
        protected Random initialValue() {
            return new Random( 100 );
        }
    };
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( null, i, cnt, latch ) {
            @Override
            protected Random getRandom() {
                return rnd.get();
            }
        } );
        thread.start();
    }
}
private static final long COUNT = 10000000;
private static final int THREADS = 2;

public static void main(String[] args) {
    System.out.println( "Shared Random" );
    testRandom(THREADS, COUNT);
//  System.out.println("ThreadLocal<Random>");
//  testTL_Random(THREADS, COUNT);
//  System.out.println("ThreadLocalRandom");
//  testTLRandom(THREADS, COUNT);
//  System.out.println("Shared Random[] with no padding");
//  testRandomArray(THREADS, COUNT, 1);
//  System.out.println("Shared Random[] with padding");
//  testRandomArray(THREADS, COUNT, 2);
}

//runner for all tests
private static class RandomTask implements Runnable
{
    private final Random rnd;
    protected final int id;
    private final long cnt;
    private final CountDownLatch latch;

    private RandomTask(Random rnd, int id, long cnt, CountDownLatch latch) {
        this.rnd = rnd;
        this.id = id;
        this.cnt = cnt;
        this.latch = latch;
    }

    protected Random getRandom()
    {
        return rnd;
    }

    @Override
    public void run() {
        try {
            final Random r = getRandom();
            latch.countDown();
            latch.await();
            final long start = System.currentTimeMillis();
            int sum = 0;
            for ( long j = 0; j < cnt; ++j )
            {
                sum += r.nextInt();
            }
            final long time = System.currentTimeMillis() - start;
            System.out.println( "Thread #" + id + " Time = " + time / 1000.0 + " sec, sum = " + sum );
        } catch (InterruptedException e) {
        }
    }
}

private static void testRandom( final int threads, final long cnt )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    final Random r = new Random( 100 );
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( r, i, cnt, latch ) );
        thread.start();
    }
}

private static void testRandomArray( final int threads, final long cnt, final int padding )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    final Random[] rnd = new Random[threads * padding];
    for ( int i = 0; i < threads * padding; ++i ) //allocate together
        rnd[ i ] = new Random( 100 );
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( rnd[ i * padding ], i, cnt, latch ) );
        thread.start();
    }
}

private static void testTLRandom( final int threads, final long cnt )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( null, i, cnt, latch ) {
            @Override
            protected Random getRandom() {
                return ThreadLocalRandom.current();
            }
        } );
        thread.start();
    }
}

private static void testTL_Random( final int threads, final long cnt )
{
    final CountDownLatch latch = new CountDownLatch( threads );
    final ThreadLocal<Random> rnd = new ThreadLocal<Random>() {
        @Override
        protected Random initialValue() {
            return new Random( 100 );
        }
    };
    for ( int i = 0; i < threads; ++i )
    {
        final Thread thread = new Thread( new RandomTask( null, i, cnt, latch ) {
            @Override
            protected Random getRandom() {
                return rnd.get();
            }
        } );
        thread.start();
    }
}

Test results

All tests were done on my workstation with Xeon E5-2650 (8 physical, 16 logical cores @ 2-2.8Ghz) with 128 Gb RAM using Linux core 3.5.0.

Shared java.util.Random

The first test uses a shared java.util.Random instance. It is badly affected by a very high contention of CAS operations of the AtomicLong seed. Even 2 threads badly suffer from such contention. Nevertheless, such contention is unlikely to happen in the real world scenario (except probably Monte Carlo method). Times with dash are minimal and maximal run times for all threads.

1 thread - 1.69 sec
2 threads - 13.2, 13.3 sec
4 threads - 34 - 47 sec
8 threads - 89 - 135 sec

"Shared" java.util.concurrent.ThreadLocalRandom

The next test is using the second class - java.util.concurrent.ThreadLocalRandom. As you can see, it does not degrade while the number of threads does not exceed the number of logical cores (modern CPUs have a lot of ALU) and then degrades linearly. Another important thing to notice is that a single thread runs 3 times faster than in case of java.util.Random - uncontended CAS operations are still not free!

1 thread - 0.57 sec
2 threads - 0.55 sec
4 threads - 0.51 sec
8 threads - 0.50 sec
16 threads - 0.53 - 0.56 sec
32 threads - 0.91 - 1.07 sec
64 threads - 0.89 - 2.07 sec
128 threads - 1.1 - 4.03 sec

"Shared" ThreadLocal<java.util.Random>

After that I decide to check the effect of wrapping java.util.Random instances inside ThreadLocal. I hoped to get the performance of a single threaded test, and I got it. Though, with a small difference - this time performance started to degrade after I exceeded the number of physical cores - sounds like CAS operations require not so numerous execution units. Nevertheless, further degradation was linear and very similar to the case of ThreadLocalRandom.

1 thread - 1.69 sec
2 threads - 1.66 sec
4 threads - 1.71 sec
8 threads - 1.76 sec
16 threads - 2.12 - 2.17 sec
32 threads - 3.7 - 4.3 sec
64 threads - 7.2 - 9.3 sec
128 threads - 14.6 - 17.4 sec

Array of java.util.Random

The final thing I wanted to check was CPU cache line improvement in ThreadLocalRandom - I wanted to emulate the situation when the old java.util.Random will suffer from lack of such feature. All you need to do for it is to create instances of java.util.Random used by various threads one after another (instead of creating them inside their threads). I used a java.util.Random[] for this purpose and then used array[N] for thread N.

Results for 8 threads varied between 4 and 13.9 sec - yes, we have cache issues!

After that I decided to find out how much padding in such array do I need in order to avoid the cache invalidation. I have added padding parameter to testRandomArray method, which allows me to use one out of padding elements in the array. It turned out that padding=2 already saves me from the cache issues: I got stable 1.765 - 1.77 sec on all 8 threads (which is the same result as I got for 8 threads using ThreadLocal<java.util.Random>).

Using Linux perf tool for results analysis

I was curious to find out why I got such results. After looking at the recently reviewed "Systems Performance: Enterprise and the Cloud" book (Chapter 6.6.12), I ran the versions of every test with 8 active threads using

perf stat -d

command, which prints verbose statistics (you may print even more counters using -e option; check the list of available options using perf list command).

Unfortunately, these results include JVM startup, so use them with care on the short running programs.

Random and ThreadLocalRandom

Let's start from comparing 2 most different results - shared java.util.Random and java.util.concurrent.ThreadLocalRandom. First log belongs to Random, second - to ThreadLocalRandom.

2,553,076,870,919 cycles                    #    2.398 GHz
2,501,593,471,621 stalled-cycles-frontend   #   97.98% frontend cycles idle
2,454,797,083,551 stalled-cycles-backend    #   96.15% backend  cycles idle
   42,807,128,658 instructions              #    0.02  insns per cycle
                                            #   58.44  stalled cycles per insn
    4,999,510,690 branches                  #    4.697 M/sec
      862,334,730 branch-misses             #   17.25% of all branches
   12,231,515,289 L1-dcache-loads           #   11.490 M/sec
    5,471,297,449 L1-dcache-load-misses     #   44.73% of all L1-dcache hits
9,321,932,029 cycles                    #    2.206 GHz
6,767,646,797 stalled-cycles-frontend   #   72.60% frontend cycles idle
2,049,190,051 stalled-cycles-backend    #   21.98% backend  cycles idle
10,094,934,215 instructions             #    1.08  insns per cycle
                                        #    0.67  stalled cycles per insn
  816,688,169 branches                  #  193.249 M/sec
    1,506,379 branch-misses             #    0.18% of all branches
1,683,890,500 L1-dcache-loads           #  398.451 M/sec
    4,508,729 L1-dcache-load-misses     #    0.27% of all L1-dcache hits
    

As you can see, Random results are terrible - it needs 284 times more CPU cycles to finish the same task! Nearly each of these cycles was stalled in the CPU pipeline. In the reality, only 42.8 billion (10^9) instructions were executed using these 2.55 trillion (10^12) cycles, thus exposing the terrible performance - 0.02 instructions executed per cycle (good non IO-based software should execute at least 1 instruction per cycle). The next metric is a number of branches - nearly 5 billion of those were executed, but 17.25% of those were wrongly predicted (this is a very bad prediction rate, causing CPU pipeline reset). Finally, code tried to load data from L1 cache 12.2 billion times, but failed at 44.73% of cases. Let's delay the interpretation of these values a little.

On the other hand, ThreadLocalRandom needed only 9.3 billion cycles to finish (284 times less than in previous example). Share of stalled instructions is considerably less - only 22% of backend instructions failed (backend is data prefetch), I think most of these failures actually happened at JVM startup during class loading. This time we needed only 10 billion instructions to execute (4.2 times less than in the previous example - this difference may tell you about the difference you can expect in the singlethreaded scenario; the actual difference turned out to be ~3.4 times). 6 times less branches were executed (816 millions), nearly all of them were predicted correctly (this is what you should generally expect). We have attempted to load data from L1 cache 1.7 billion times and succeeded nearly every time (0.27% failures).

Now let's interpret these values. First of all, we should be looking for multiplies of 800M, because we have 8 threads, each of them executing a loop 100M times.

Branches are most obvious. We have 8 threads, each of them executes for loop 100M times, which means 800M branches have to be executed by the test code. ThreadLocalRandom executes 816M branches, so we have 16M branches left for various JVM startup checks. We can conclude that no branches were needed for ThreadLocalRandom code, because otherwise we should see at least 1.6G branches in the output (extra 1 branch per loop). Branchless code is generally fast!

Random, on the other hand, needed ~5G branches. As we have found above, JVM is responsible only for a tiny share of these branches, so we may say that each CAS operation on the seed needed ~6.25 (5G/800M) branches before a seed was actually set. 862M branch misses support this conclusion - this time CPU expected the loop to continue (failure to set a value by CAS instruction) - successes were treated as branch failures :)))

L1 cache loads. We have 12.2G attempts to load data from L1 cache for Random and 1.7G for ThreadLocalRandom. As you can see in the code below (Random.next(int)), each iteration requires at least 2 memory accesses to the random seed (one for get and one for compare), but there may actually be more of them:

1
2
3
4
5
6
7
8
9
protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
}
protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
}

ThreadLocalRandom.next(int) requires 2 memory accesses as well (read accesses to rnd, which is ThreadLocalRandom seed, in lines 2 and 3):

1
2
3
4
protected int next(int bits) {
    rnd = (rnd * multiplier + addend) & mask;
    return (int) (rnd >>> (48-bits));
}
protected int next(int bits) {
    rnd = (rnd * multiplier + addend) & mask;
    return (int) (rnd >>> (48-bits));
}

So, in case of ThreadLocalRandom we always have 2 memory accesses to seed per iteration, which results in 1.6G L1 cache loads.

Unfortunately, it is not that obvious how many memory accesses is needed for each iteration of Randon.next(int) without actual generated assembler code.

ThreadLocal<Random>

33,255,538,502 cycles                    #    2.338 GHz
26,334,159,876 stalled-cycles-frontend   #   79.19% frontend cycles idle
13,886,446,385 stalled-cycles-backend    #   41.76% backend  cycles idle
19,278,411,972 instructions              #    0.58  insns per cycle
                                         #    1.37  stalled cycles per insn
 2,431,012,359 branches                  #  170.882 M/sec
     1,462,720 branch-misses             #    0.06% of all branches
 5,700,951,571 L1-dcache-loads           #  400.734 M/sec
     6,710,655 L1-dcache-load-misses     #    0.12% of all L1-dcache hits

In this case we have uncontended Random execution, so we should expect minimal possible values for our counters.

We have 2.4G branches, which means that 1 iteration requires 3 branches. The first one is in the test for loop. The second and the third could be spotted in Random.next(int):

1
2
3
4
5
do {
    oldseed = seed.get();
    nextseed = (oldseed * multiplier + addend) & mask;
} while (!seed.compareAndSet(oldseed, nextseed));
    
do {
    oldseed = seed.get();
    nextseed = (oldseed * multiplier + addend) & mask;
} while (!seed.compareAndSet(oldseed, nextseed));
    

while condition is responsible for the second branch and AtomicLong.compareAndSet is responsible for the third branch.

Random[] with no padding

The final case to look at is CPU cache contention by 2 different instances of Random (padding=1). As you may see, the number of instructions, branches and L1 cache loads is very similar to the previous uncontended example. But unlike previous example, number of cache loads has skyrocketed to 1.7G, which suggests 2 cache load failures per iteration. If you will look at the code snippet above, you will see that a seed is accessed twice per iteration - when we get it at line 2 and when we set it (in compare operation) at line 4. These permanent cache invalidations meant 2 RAM accesses on every iteration, which made the code 6 times slower.

209,445,037,827 cycles                    #    2.416 GHz
189,365,989,000 stalled-cycles-frontend   #   90.41% frontend cycles idle
169,936,986,823 stalled-cycles-backend    #   81.14% backend  cycles idle
 19,659,497,312 instructions              #    0.09  insns per cycle
                                          #    9.63  stalled cycles per insn
  2,475,303,839 branches                  #   28.552 M/sec
      2,315,797 branch-misses             #    0.09% of all branches
  5,719,174,890 L1-dcache-loads           #   65.968 M/sec
  1,703,679,647 L1-dcache-load-misses     #   29.79% of all L1-dcache hits

Summary

  • Do not share an instance of java.util.Random between several threads in any circumstances, wrap it in ThreadLocal instead.
  • From Java 7 prefer java.util.concurrent.ThreadLocalRandom to java.util.Random in all circumstances - it is backwards compatible with existing code, but uses cheaper operations internally.

One thought on “java.util.Random and java.util.concurrent.ThreadLocalRandom in multithreaded environments

  1. Pingback: Changes to String in Java 7u6 - Java Performance Tuning Guide

Comments are closed.