Introduction to JMH Profilers

by Mikhail Vorontsov

This article follows Introduction to JMH article, which should be read prior to reading this article.

This article will give you an overview of profilers available inside the JMH framework.

List of available profilers

Name Description
CL Classloader profiling via standard MBeans
COMP JIT compiler profiling via standard MBeans
GC GC profiling via standard MBeans
HS_CL HotSpot ™ classloader profiling via implementation-specific MBeans
HS_COMP HotSpot ™ JIT compiler profiling via implementation-specific MBeans
HS_GC HotSpot ™ memory manager (GC) profiling via implementation-specific MBeans
HS_RT HotSpot ™ runtime profiling via implementation-specific MBeans
HS_THR HotSpot ™ threading subsystem via implementation-specific MBeans
STACK Simple and naive Java stack profiler

You can specify which profiler to use via JMH API:

1
2
3
4
5
Options opt = new OptionsBuilder()
        .include(".*" + YourClass.class.getSimpleName() + ".*")
        .forks(1)
        .addProfiler( ProfilerType.STACK )
        .build();
Options opt = new OptionsBuilder()
        .include(".*" + YourClass.class.getSimpleName() + ".*")
        .forks(1)
        .addProfiler( ProfilerType.STACK )
        .build();


Simple STACK profiler

STACK profiler is useful for rough evaluation of methods which took most of your CPU. It works in a separate thread, wakes up periodically and gets stack traces of all application threads. It keeps top N stack lines until the end of test, so it needs some extra memory.

This profiler is not much different from the sampling functionality of commercial profilers or jstack.

All JMH profilers could be configured via Java parameters (java -Dname=value). You can also specify these parameters in the @Fork annotation:

1
@Fork(value = 1, jvmArgsAppend = "-Djmh.stack.lines=3")
@Fork(value = 1, jvmArgsAppend = "-Djmh.stack.lines=3")

STACK profiler has the following parameters:

Name Description Default
jmh.stack.lines Number of top stack lines to save. JMH will sort top sections (not lines) to find out the biggest CPU consumers. So, if you have 2 following most active stack traces: “A, B, C” and “A, B, D” (and jmh.stack.lines=3), you will have them separately in the output. For example: “20%: A, B, C; 15%: A, B, D”. At the same time, if jmh.stack.lines=1, then you will likely have the following output: “35%: A” 1 (you may want to increase it if there are multiple callers for the top methods)
jmh.stack.top Number of top stack traces to show in the profiler output. I usually need just a few top ones. 10
jmh.stack.period How often a profiler should take samples of your stack traces (in milliseconds). See discussion below. 10 (ms)
jmh.stack.detailLine Should we save stack line numbers. This will allow you to distinguish between different lines in the same method. This could be quite useful for long methods (especially with several loops). This feature works only with jmh.stack.lines=1. false

How often can we poll the stack traces? You may remember that populating the stack trace is a costly operation which should be avoided if possible. We may even think that getting a stack trace may freeze the other threads execution… Let’s use JMH and see. We will test 2 methods of getting stack traces (JMH uses management beans) and their impact on other threads. The test will contain 2 groups, each of them consists of 2 threads: stack dumping thread and a thread calculating sin(PI). There is also a third, baseline test, which calculates sin(PI) in both threads.

Using this test, we will be able to compare the performance of both stack dumping methods and their impact on other threads (the faster other threads are running, the better):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS )
@Measurement(iterations = 7, time = 3, timeUnit = TimeUnit.SECONDS)
@Threads( 2 )
@State(Scope.Thread)
public class ThreadDumpTest {
    private static long SLEEP = 10;
 
    boolean lockedMonitors = false;
    boolean lockedSynchronizers = false;
 
    double val = Math.PI;
 
    @GenerateMicroBenchmark
    public double testSine()
    {
        return Math.sin( val );
    }
 
    @Group("jmh")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public ThreadInfo[] testJmhWay() throws InterruptedException {
        Thread.sleep( SLEEP );
        return ManagementFactory.getThreadMXBean().dumpAllThreads(lockedMonitors, lockedSynchronizers);
    }
 
    @Group("jmh")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public double sine1()
    {
        return Math.sin( val );
    }
 
 
    @Group("jdk")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public Map<Thread, StackTraceElement[]> testCoreWay() throws InterruptedException {
        Thread.sleep( SLEEP );
        return Thread.getAllStackTraces();
    }
 
    @Group("jdk")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public double sine2()
    {
        return Math.sin( val );
    }
 
    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(".*" + ThreadDumpTest.class.getSimpleName() + ".*")
                .forks(1)
                .build();
 
        new Runner(opt).run();
    }
}
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Warmup(iterations = 3, time = 1, timeUnit = TimeUnit.SECONDS )
@Measurement(iterations = 7, time = 3, timeUnit = TimeUnit.SECONDS)
@Threads( 2 )
@State(Scope.Thread)
public class ThreadDumpTest {
    private static long SLEEP = 10;

    boolean lockedMonitors = false;
    boolean lockedSynchronizers = false;

    double val = Math.PI;

    @GenerateMicroBenchmark
    public double testSine()
    {
        return Math.sin( val );
    }

    @Group("jmh")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public ThreadInfo[] testJmhWay() throws InterruptedException {
        Thread.sleep( SLEEP );
        return ManagementFactory.getThreadMXBean().dumpAllThreads(lockedMonitors, lockedSynchronizers);
    }

    @Group("jmh")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public double sine1()
    {
        return Math.sin( val );
    }


    @Group("jdk")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public Map<Thread, StackTraceElement[]> testCoreWay() throws InterruptedException {
        Thread.sleep( SLEEP );
        return Thread.getAllStackTraces();
    }

    @Group("jdk")
    @GroupThreads(1)
    @GenerateMicroBenchmark
    public double sine2()
    {
        return Math.sin( val );
    }

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(".*" + ThreadDumpTest.class.getSimpleName() + ".*")
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

Here are the test results:

Benchmark                                  Mode   Samples         Mean   Mean error    Units
i.j.p.e.ThreadDumpTest.jdk                 avgt         7        0.687        0.024    us/op
i.j.p.e.ThreadDumpTest.jdk:sine2           avgt         7        0.344        0.012    us/op
i.j.p.e.ThreadDumpTest.jdk:testCoreWay     avgt         7      524.842       42.940    us/op
i.j.p.e.ThreadDumpTest.jmh                 avgt         7        0.640        0.024    us/op
i.j.p.e.ThreadDumpTest.jmh:sine1           avgt         7        0.320        0.012    us/op
i.j.p.e.ThreadDumpTest.jmh:testJmhWay      avgt         7      544.717       61.805    us/op
i.j.p.e.ThreadDumpTest.testSine            avgt         7        0.164        0.002    us/op

As you can see, both thread dumping methods are affecting other threads if called too often (the impact is 2 times slowdown in this test – compare 0.164 us/op with 0.320 / 0.344 us/op.

Try adding Thread.sleep(1) to testCoreWay/testJmhWay methods prior to getting the stack dumps. You will see that the effect of dumping thread has gone down to ~20%, which is still a huge number for a measurement tool (all these numbers are CPU-dependent!):

Benchmark                                  Mode   Samples         Mean   Mean error    Units
i.j.p.e.ThreadDumpTest.jdk                 avgt         7        0.446        0.011    us/op
i.j.p.e.ThreadDumpTest.jdk:sine2           avgt         7        0.223        0.005    us/op
i.j.p.e.ThreadDumpTest.jdk:testCoreWay     avgt         7     1037.739       18.784    us/op
i.j.p.e.ThreadDumpTest.jmh                 avgt         7        0.450        0.013    us/op
i.j.p.e.ThreadDumpTest.jmh:sine1           avgt         7        0.225        0.006    us/op
i.j.p.e.ThreadDumpTest.jmh:testJmhWay      avgt         7     1026.066       17.075    us/op
i.j.p.e.ThreadDumpTest.testSine            avgt         7        0.181        0.015    us/op

Now let’s increase sleep to 10 ms (default value for STACK profiler). As you can see in test results, the effect of thread dumping is becoming not noticeable now:

Benchmark                                  Mode   Samples         Mean   Mean error    Units
i.j.p.e.ThreadDumpTest.jdk                 avgt         7        0.332        0.013    us/op
i.j.p.e.ThreadDumpTest.jdk:sine2           avgt         7        0.165        0.002    us/op
i.j.p.e.ThreadDumpTest.jdk:testCoreWay     avgt         7    10319.648      766.501    us/op
i.j.p.e.ThreadDumpTest.jmh                 avgt         7        0.333        0.010    us/op
i.j.p.e.ThreadDumpTest.jmh:sine1           avgt         7        0.166        0.005    us/op
i.j.p.e.ThreadDumpTest.jmh:testJmhWay      avgt         7    10342.097      567.886    us/op
i.j.p.e.ThreadDumpTest.testSine            avgt         7        0.172        0.012    us/op

You can find out the highest possible frequency of thread dumps on your computer using this test.

Simple classloader profiler: CL

This profiler was supposed to be used to get a number of loaded/unloaded classes during the current test run. I am not sure about its usefulness – the problem with it is that it will try to report you class loading / unloading stats during test iterations, but the same code will be run during warmup as well. As a result, you are unlikely to see any loaded classes after warmup.

In order to get non-zero results, you need to set number of warmup iterations to 0: @Warmup(iterations = 0). You will see an output like this on the very first iteration:

Iteration   1: 589629.578 ns/op
       Class | wall time = 1.007 secs, loaded = +5, unloaded = +0

Note that if both loaded and unloaded are equal to zero, this profiler will print something only if you have set the verbose output mode:

1
2
3
4
5
Options opt = new OptionsBuilder()
        .include(".*" + ConversionTests.class.getSimpleName() + ".*")
        .verbosity(VerboseMode.EXTRA) //VERBOSE OUTPUT
        .addProfiler(ProfilerType.CL)
        .build();
Options opt = new OptionsBuilder()
        .include(".*" + ConversionTests.class.getSimpleName() + ".*")
        .verbosity(VerboseMode.EXTRA) //VERBOSE OUTPUT
        .addProfiler(ProfilerType.CL)
        .build();

Simple JIT compiler profiler: COMP

This profiler will tell you the time spent by the JIT compiler on the current iteration. It also requires verbose output mode if the time = 0. Unlike CL profiler, this one is useful while you tune your tests: it can help you to find out the sufficient number of warmup iterations = would not let your actual results to be affected by JIT compilation. The algorithm is simple: turn on COMP profiler and increase a number of warmup iterations while you see COMP profiler output.

Here is the possible COMP profiler output with verbose mode turned on:

Iteration   1: 412075.665 ns/op
         JIT | wall time = 1.060 secs, JIT time = 1.703 secs (HotSpot 64-Bit Tiered Compilers)

Iteration   2: 134454.109 ns/op
         JIT | wall time = 1.000 secs, JIT time = 0.930 secs (HotSpot 64-Bit Tiered Compilers)

Iteration   3: 121027.712 ns/op
         JIT | wall time = 1.000 secs, JIT time = 0.000 secs (HotSpot 64-Bit Tiered Compilers)

Simple garbage collection profiler: GC

The simple garbage collection profiler will allow you to find out the time spent in garbage collection during a test and a number of garbage collections (major and minor collections could not be distinguished using this profiler).

Usefulness of this profiler is rather questionable: while you can estimate the share of time spent in GC, it actually depends on your JVM settings. For example, GC time will likely reduce if you set -Xmx and -Xms to the same value (1G, for example).

Here is an example of GC profiler output:

Iteration   1: 486415.046 ns/op
          GC | wall time = 1.009 secs,  GC time = 0.000 secs, GC% = 0.00%, GC count = +0

Iteration   2: 145914.896 ns/op
          GC | wall time = 1.000 secs,  GC time = 0.007 secs, GC% = 0.70%, GC count = +1

Iteration   3: 110347.832 ns/op
          GC | wall time = 1.000 secs,  GC time = 0.010 secs, GC% = 1.00%, GC count = +2

Hotspot profilers: HS_*

All hotspot profilers rely on getting a set of counters from JMX and calculating the difference before and after an iteration. This means that I would not recommend using verbose output mode with these profilers – you will get too many zero values (no difference).

Hotspot compilation profiler: HS_COMP

This profiler will give you more stats about: number of compiled blocks of code, on-stack replacements, invalidated code (for example, when a second implementation of a given interface was loaded) and bailouts. It will tell you the amount of generated code and time spent on compilation (as well as number of compiler threads).

Here is the example output:

Iteration   1: 449304.868 ns/op
     HS(JIT) | wall time = 1.002 secs, JIT time = 2.782 secs, 3 threads
             | +83 compiles, +0 bailouts, +0 invalidates, +134464 native bytes generated, +241840 instructions
             | +1 OSR compiles, +108 bytecode bytes compiled, +16459 us spent
             | +82 normal compiles, +21618 bytecode bytes compiled, +2768017 us spent

Iteration   2: 136779.268 ns/op
     HS(JIT) | wall time = 1.000 secs, JIT time = 1.767 secs, 3 threads
             | +18 compiles, +0 bailouts, +0 invalidates, +35680 native bytes generated, +71240 instructions
             | +0 OSR compiles, +0 bytecode bytes compiled, +0 us spent
             | +18 normal compiles, +6989 bytecode bytes compiled, +1774178 us spent

Hotspot threading profiler: HS_THR

This profiler will give you a number of started / active threads in your application. While the information about the current number of threads after test iteration is not that useful (most likely, it will be a main thread plus JRE housekeeping threads), you may be interested to know the delta – how many threads were started / stopped during the iteration.

Let’s write a simple test, which will tell you how long would it take to spawn a thread.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 4, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Threads(1)
@State(Scope.Thread)
public class ThreadSpawnTest {
 
    @GenerateMicroBenchmark
    public Thread testThreadSpawn()
    {
        final Runnable task = new Runnable() {
            @Override
            public void run() {
            }
        };
 
        final Thread thr = new Thread( task );
        thr.start();
        return thr;
    }
 
 
    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(".*" + ThreadSpawnTest.class.getSimpleName() + ".*")
                .addProfiler(ProfilerType.HS_THR )
                .forks(1)
                .build();
 
        new Runner(opt).run();
    }
}
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@Warmup(iterations = 4, time = 3, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 3, timeUnit = TimeUnit.SECONDS)
@Threads(1)
@State(Scope.Thread)
public class ThreadSpawnTest {

    @GenerateMicroBenchmark
    public Thread testThreadSpawn()
    {
        final Runnable task = new Runnable() {
            @Override
            public void run() {
            }
        };

        final Thread thr = new Thread( task );
        thr.start();
        return thr;
    }


    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(".*" + ThreadSpawnTest.class.getSimpleName() + ".*")
                .addProfiler(ProfilerType.HS_THR )
                .forks(1)
                .build();

        new Runner(opt).run();
    }
}

Here is the test output:

Iteration   1: 0.246 ms/op
     HS(THR) | 20080 (+4068) threads started, 6 (-1) threads alive, 5 (-1) threads daemonized

Iteration   2: 0.245 ms/op
     HS(THR) | 24165 (+4082) threads started, 6 (-3) threads alive, 5 (-2) threads daemonized

Iteration   3: 0.246 ms/op
     HS(THR) | 28231 (+4062) threads started, 6 (-4) threads alive, 5 (-3) threads daemonized

Iteration   4: 0.243 ms/op
     HS(THR) | 32371 (+4136) threads started, 6 (-4) threads alive, 5 (-4) threads daemonized

Iteration   5: 0.242 ms/op
     HS(THR) | 36506 (+4135) threads started, 6 (+0) threads alive, 5 (+0) threads daemonized

Result : 0.244 ±(99.9%) 0.007 ms/op
  Statistics: (min, avg, max) = (0.242, 0.244, 0.246), stdev = 0.002
  Confidence interval (99.9%): [0.237, 0.252]

Benchmark                                   Mode   Samples         Mean   Mean error    Units
i.j.p.e.ThreadSpawnTest.testThreadSpawn     avgt         5        0.244        0.007    ms/op

We are interested in the profiler results here. As you can see, we create ~4100 threads on each one second iteration. At the same time, number of alive threads remains constant (6). It means that all these threads were stopped and destroyed by OS.

Hotspot runtime profiler: HS_RT

This profiler will give you the information about Java locks (a.k.a monitors) and safepoints (places in your code where interaction from JVM is allowed – for example, garbage collection or thread dump could be made).

HS_RT:
Iteration   1: 396393.233 ns/op
      HS(RT) | 128 fat monitors remaining, +0 monitors inflated, +5 monitors deflated
             | +0 contended lock attempts, +0 parks, +0 notify()'s, +0 futile wakeup(s)
             | +1 safepoints hit(s), +132 ms spent on sync safepoints, +271 ms spent on safepoints

Hotspot classloading and garbage collection profilers: HS_CL and HS_GC

These two profilers report the difference between classloading / garbage collection JMX counters. Both these profilers could greatly help JVM core devs, but I am not sure about us, ordinary mortals :) Besides that, even Google does not know about all of these counters.

Examples: HS_CL

HS(Class) | difference: {java.cls.loadedClasses=1, sun.cls.appClassLoadCount=1, sun.cls.appClassLoadTime=130,
sun.cls.appClassLoadTime.self=130, sun.cls.jvmFindLoadedClassNoLockCalls=2, sun.cls.loadedBytes=1312,
sun.cls.lookupSysClassTime=186, sun.cls.methodBytes=528, sun.cls.parseClassTime=287, sun.cls.parseClassTime.self=286,
sun.cls.sharedClassLoadTime=1, sun.cls.sysClassBytes=1763, sun.cls.sysClassLoadTime=533, sun.cls.time=602}

HS_GC:

HS(GC) | difference: {sun.gc.collector.0.invocations=2, sun.gc.collector.0.lastEntryTime=5589811, sun.gc.collector.0.lastExitTime=5599082,
sun.gc.collector.0.time=21043, sun.gc.compressedclassspace.capacity=262144, sun.gc.compressedclassspace.used=546792,
sun.gc.generation.0.space.0.used=-137562032, sun.gc.generation.0.space.1.used=1310752, sun.gc.generation.1.space.0.used=16384,
sun.gc.metaspace.capacity=1048576, sun.gc.metaspace.used=4734192, sun.gc.policy.avgBaseFootprint=268435456,
sun.gc.policy.avgMinorIntervalTime=1682, sun.gc.policy.avgMinorPauseTime=6, sun.gc.policy.avgPromotedAvg=-268427264,
sun.gc.policy.avgPromotedPaddedAvg=-268427264, sun.gc.policy.avgSurvivedAvg=-43057136, sun.gc.policy.avgSurvivedDev=98296,
sun.gc.policy.avgSurvivedPaddedAvg=-42762248, sun.gc.policy.avgYoungLive=1507344, sun.gc.policy.desiredSurvivorSize=44564480,
sun.gc.policy.liveSpace=269942784, sun.gc.policy.minorPauseTime=5, sun.gc.policy.mutatorCost=98,
sun.gc.policy.oldCapacity=-357564416, sun.gc.policy.promoted=8192, sun.gc.policy.survived=1310752, sun.gc.policy.tenuringThreshold=-8,
sun.gc.tlab.alloc=34171776, sun.gc.tlab.allocThreads=2, sun.gc.tlab.fills=53, sun.gc.tlab.gcWaste=671086, sun.gc.tlab.maxFills=52,
sun.gc.tlab.maxGcWaste=671086, sun.gc.tlab.maxSlowWaste=220, sun.gc.tlab.slowWaste=220}

Summary

  • JMH profilers are a cheap and convenient way to find out the bottlenecks in your microbenchmarks: they try to avoid measuring as much of JMH code as possible and you can use any subset of them simultaneously.
  • STACK profiler makes a thread dump every 10 ms by default, but you may probably want to decrease this interval a little on powerful boxes.
  • JIT compiler profilers (COMP / HS_COMP) are recommended for use on most of benchmarks – they will let you know if you have insufficiently warmed up your code.

2 thoughts on “Introduction to JMH Profilers

  1. Pingback: Introduction to JMH   | Java Performance Tuning Guide

  2. Pingback: Java Annotated Monthly – May 2014 | JetBrains IntelliJ IDEA Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code lang=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" extra="">