Java Performance Tuning Guide http://java-performance.info Java performance tuning guide - various tips on improving performance of your Java code Mon, 10 Aug 2015 15:05:40 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.9 Java server application troubleshooting using JDK tools http://java-performance.info/java-server-application-troubleshooting-using-jdk-tools/ http://java-performance.info/java-server-application-troubleshooting-using-jdk-tools/#comments Mon, 20 Jul 2015 14:13:24 +0000 http://java-performance.info/?p=962 by Mikhail Vorontsov 1. Introduction 2. Troubleshooting scenarios     2.1. Getting a list of running JVMs     2.2. Making a heap dump     2.3. Analyzing a class histogram     2.4. Making a thread dump     2.5. Running Java Flight Recorder 1. Introduction In the Java world most of us are used to GUI tools for all stages of Java application […]

The post Java server application troubleshooting using JDK tools appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

1. Introduction
2. Troubleshooting scenarios
    2.1. Getting a list of running JVMs
    2.2. Making a heap dump
    2.3. Analyzing a class histogram
    2.4. Making a thread dump
    2.5. Running Java Flight Recorder

1. Introduction

In the Java world most of us are used to GUI tools for all stages of Java application development: writing your code, debugging and profiling it. We often prefer to set up the server environment on our dev boxes and try to reproduce the problems locally using familiar tools. Unfortunately, it is often impossible to reproduce some issues locally for various reasons. For example, you may not be authorised to access the real client data which is processed by your server application.

In situations like this one you need to troubleshoot the application remotely on the server box. You should keep in mind that you can not properly troubleshoot an application with bare JRE in your hands: it contains all the troubleshooting functionality, but there is literally no way to access it. As a result, you need either a JDK or some 3rd party tools on the same box. This article will describe JDK tools, because you are likely to be allowed to use it on production boxes compared to any 3rd party tools which require security audit in many organizations.

In general case, it is sufficient just to unpack the JDK distribution to your box – you don’t need to install it properly for troubleshooting purposes (actually, proper installation may be undesirable in a lot of cases). For JMX based functionality you can install literally any Java 7/8 JDK, but some tools can not recognize the future JDK, so I advice you to install either the latest Java 7/8 JDK or the build exactly matching to server JRE – it allows you to dump the app heap for applications with no safepoints being accessed at the moment (some applications in the idle mode are the easy example of “no safepoints” applications).

2. Troubleshooting scenarios

2.1. Getting a list of running JVMs

In order to start working you nearly always need to get a list of running JVMs, their process IDs and command line arguments. Sometimes it may enough: you may find a second instance of the same application doing the same job concurrently (and damaging the output files / reopening sockets / doing some other stupid things).

Just run jcmd without any arguments. It will print you a list of running JVMs:

3824 org.jetbrains.idea.maven.server.RemoteMavenServer
2196
780 sun.tools.jcmd.JCmd

Now you can see what diagnostic commands are available for a given JVM by running jcmd <PID> help command. Here is a sample output for VisualVM:

>jcmd 3036 help

3036:
The following commands are available:
JFR.stop
JFR.start
JFR.dump
JFR.check
VM.native_memory
VM.check_commercial_features
VM.unlock_commercial_features
ManagementAgent.stop
ManagementAgent.start_local
ManagementAgent.start
GC.rotate_log
Thread.print
GC.class_stats
GC.class_histogram
GC.heap_dump
GC.run_finalization
GC.run
VM.uptime
VM.flags
VM.system_properties
VM.command_line
VM.version
help

Type jcmd <PID> <COMMAND_NAME> to either run a diagnostic command or get an error message asking for command arguments:

>jcmd 3036 GC.heap_dump

3036:
java.lang.IllegalArgumentException: The argument 'filename' is mandatory.

You can get more information about a diganostic command arguments by using the following command: jcmd <PID> help <COMMAND_NAME>. For example, here is the output for GC.heap_dump command:

>jcmd 3036 help GC.heap_dump
        
3036:
GC.heap_dump
Generate a HPROF format dump of the Java heap.

Impact: High: Depends on Java heap size and content. Request a full GC unless the '-all' option is specified.

Permission: java.lang.management.ManagementPermission(monitor)

Syntax : GC.heap_dump [options] <filename>

Arguments:
filename :  Name of the dump file (STRING, no default value)

Options: (options must be specified using the <key> or <key>=<value> syntax)
-all : [optional] Dump all objects, including unreachable objects (BOOLEAN, false)


2.2. Making a heap dump

jcmd provides you with a handy interface for making the heap dump in HPROF format. Simply run jcmd <PID> GC.heap_dump <FILENAME>. Pay attention that the file name is relative to the running JVM current directory instead of your current directory, so you may want to specify the full path. It is a good idea to use .hprof extension for the dump file name.

After thread dump completion you can copy a file to your own box and open it either in VisualVM (it is a part of JDK) and use its heap walker and query language functionality or load it into JOverflow plugin of Java Mission Control and analyze it for various memory issues.

Note 1: of course there is lots of other tools capable to handle hprof files: NetBeans, Eclipse Memory Analyzer, YourKit, etc. Use your favorite one once you have downloaded an .hprof file to your box.

Note 2: you can make a heap dump using jmap tool as well: jmap -dump:live,file=<FILE_NAME> <PID>. The problem with it is that it is officially documented as unsupported. Many of us were thinking that unsupported stuff in JDK will remain forever, but it turns out that this is no longer the case: JEP 240, JEP 241

2.3. Analyzing a class histogram

If you are looking for a memory leak, you are usually interested just in a number of live objects of certain type in a heap. For example, you may know that you can have only one object of a certain type at a time (some sort of main working class in your application). There also may be one or more instances of the same class in the old generation which are not garbage collected so far, but they should not be accessible from the application roots.

To print the class histogram run either of these commands (both commands print the number of live objects):

jcmd <PID> GC.class_histogram
jmap -histo:live <PID>

Here are the top few lines of the example output:

 num     #instances         #bytes  class name
----------------------------------------------
   1:          5923        5976952  [I
   2:         50034        4127704  [C
   3:         49465        1187160  java.lang.String
   4:           188        1069496  [J
   5:          3985        1067240  [Ljava.util.HashMap$Node;
   6:          8756         982872  java.lang.Class
   7:          2855         835792  [B
   8:         23570         754240  java.util.HashMap$Node
   9:         13964         671440  [Ljava.lang.Object;
  10:          9642         308544  java.util.Hashtable$Entry
  11:          4453         213744  java.util.HashMap

Note that occupied size in bytes is a shallow size – it does not include any child objects. It is easy to notice this fact from char[] (class name = [C) and String stats – while the number of instances is similar (though there are always more char[]-s than Strings, the size of char[]-s is noticeably bigger, which can not be the case if String size included the underlying char[] size.

Now you can just grep/search for the class name you are interested in and check the number of live instances. If you see more instances than expected, make a heap dump and analyze it in any heap walker (see above).

2.4. Making a thread dump

Sometimes your application may be reported as “not doing anything/got stuck”. There are many kinds of getting “stuck” – a deadlock, high resource contention or simply an O(N10) algorithm processing the requests of a few million users :) In all these situations you should know what your application threads are executing and what locks do they hold.

There are 2 kinds of locks: original ones based on synchronized keyword and Object.wait/notifyAll methods and the java.util.concurrent locks introduced in Java 5. The major difference between them is that former are bound to the stack frame where you enter the synchronized section and were always available in the thread dumps. The latter (java.util.concurrent), on the other hand, are not stack frame bound – you can enter the lock in one method and leave it in the different one. As a result, for some time they were not printed in the thread dump at all and even now they are still an option. Nevertheless, you need both sorts of locks in your thread dump to properly investigate the threading issues.

There are 3 ways to print the application thread dump. You can run kill -3 <PID> on Linux. Or you can run one of the following commands on any platform:

jstack <PID>
jcmd <PID> Thread.print

2.5. Running Java Flight Recorder

All tools mentioned up to this point in this article should be used only for quick investigation. For deeper analysis I recommend to rely on the built-in Java Flight Recorder. Take a look at my recent article about Java Mission Control for more details.

Running JFR is a 3 step process:

  1. You need to create a JFR template file containing desired settings. In order to do this, run jmc and go to Window -> Flight Recording Template Manager menu. Once the profile is ready, export it to the file and then send it to the box you are working on.
  2. JFR requires JDK commercial license. I ask you to take a look again at my recent Java Mission Control article for details. I suppose you are happy with the license terms. Now you need to unlock commercial features on the required JVM:

    jcmd <PID> VM.unlock_commercial_features
  3. After that you can start JFR. Here is an example of the command line:

    jcmd <PID> JFR.start name=test duration=60s settings=template.jfc filename=output.jfr

    This command runs JFR immediately (delay property is not set) and collects JVM information for 60 seconds using settings from template.jfc template file and writing results into output.jfr (it makes sense to use absolute paths for both files).

Once recording is done, you can copy the .jfr file to your laptop and analyze it inside jmc GUI. It contains nearly all information you need to troubleshoot JVM with the exception of the full heap dump, which you can make and copy to your box separately.

The post Java server application troubleshooting using JDK tools appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/java-server-application-troubleshooting-using-jdk-tools/feed/ 0
String switch implementation http://java-performance.info/string-switch-implementation/ http://java-performance.info/string-switch-implementation/#comments Sat, 04 Jul 2015 12:52:35 +0000 http://java-performance.info/?p=955 by Mikhail Vorontsov This article covers the implementation details of String switch introduced in Java 7. It is a syntactic sugar on top of the normal switch operator. Suppose you have the following method: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 public int switchTest( final String s […]

The post String switch implementation appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

This article covers the implementation details of String switch introduced in Java 7. It is a syntactic sugar on top of the normal switch operator.

Suppose you have the following method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public int switchTest( final String s )
{
    switch ( s )
    {
        case "a" :
            System.out.println("aa");
            return 11;
        case "b" :
            System.out.println("bb");
            return 22;
        default :
            System.out.println("cc");
            return 33;
    }
}
public int switchTest( final String s )
{
    switch ( s )
    {
        case "a" :
            System.out.println("aa");
            return 11;
        case "b" :
            System.out.println("bb");
            return 22;
        default :
            System.out.println("cc");
            return 33;
    }
}

It is converted by javac into the following code (decompiled back into Java):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public int switchTest(String var1) {
    byte var3 = -1;
    switch(var1.hashCode()) {
    case 97:
        if(var1.equals("a")) {
            var3 = 0;
        }
        break;
    case 98:
        if(var1.equals("b")) {
            var3 = 1;
        }
    }
 
    switch(var3) {
    case 0:
        System.out.println("aa");
        return 11;
    case 1:
        System.out.println("bb");
        return 22;
    default:
        System.out.println("cc");
        return 33;
    }
}
public int switchTest(String var1) {
    byte var3 = -1;
    switch(var1.hashCode()) {
    case 97:
        if(var1.equals("a")) {
            var3 = 0;
        }
        break;
    case 98:
        if(var1.equals("b")) {
            var3 = 1;
        }
    }

    switch(var3) {
    case 0:
        System.out.println("aa");
        return 11;
    case 1:
        System.out.println("bb");
        return 22;
    default:
        System.out.println("cc");
        return 33;
    }
}

The generated code consists of 2 parts:

  • Translation from String into a distinct int for each case, which is implemented in the first switch statement.
  • The actual switch based on int-s.

The first switch contains a case for each distinct String.hashCode in the original String switch labels. After matching by hash code, a string is compared for equality to every string with the same hash code. It is pretty unlikely that 2 strings used in switch labels will have the same hash code, so in most cases you will end up with exactly one String.equals call.

After seeing the generated code, it becomes clear why you can not use null as a switch label: the first switch starts from calculating the hashCode of the switch argument.

What can we say about the performance of the underlying int switch? As you can find in one of my earlier articles, a switch is implemented as a fixed map with a table size of approximately 20 (which is fine for most of common cases).

Finally, we should note that String.hashCode implementation has implicitly became the part of the Java Language Specification after it was used in the String switch implementation. It can no longer be changed without breaking the .class files containing String switch, which were compiled with the older versions of Java.

The post String switch implementation appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/string-switch-implementation/feed/ 0
Oracle Java Mission Control Overview http://java-performance.info/oracle-java-mission-control-overview/ http://java-performance.info/oracle-java-mission-control-overview/#comments Sat, 20 Jun 2015 02:00:24 +0000 http://java-performance.info/?p=921 by Mikhail Vorontsov Introduction This article will describe the Java Mission Control – a JDK GUI tool (jmc / jmc.exe) available since Java 7u40. We will also discuss Java Flight Recorder – a surprisingly good JDK profiler with some features not available in any other project. Finally, we will look at JOverflow Analyzer – yet […]

The post Oracle Java Mission Control Overview appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

Introduction

This article will describe the Java Mission Control – a JDK GUI tool (jmc / jmc.exe) available since Java 7u40. We will also discuss Java Flight Recorder – a surprisingly good JDK profiler with some features not available in any other project. Finally, we will look at JOverflow Analyzer – yet another semi-free tool (free for development, commercial for production), which allows you to analyze a lot of memory usage anti-patterns in your application based on a simple HPROF file.

Java Mission Control

Oracle Java Mission Control is a tool available in the Oracle JDK since Java 7u40. This tool originates from JRockit JVM where it was available for years. JRockit and its version of JMC were well described in a Oracle JRockit: The Definitive Guide book written by two JRockit senior developers (also visit the Marcus Hirt blog – the first place you should be looking for any JMC news).

Oracle JMC could be used for 2 main purposes:

  • Monitoring the state of multiple running Oracle JVMs
  • Java Flight Recorder dump file analysis

JMC license

Current JMC license (see “Supplemental license terms” here ) allows you to freely use JMC for development, but it requires the purchase of a commercial license if you want to use it in production (this is my personal opinion, I am not a lawyer :) ). This means that you can avoid spending extra dollars if you have a proper QA process :)

JMC plug-ins

JMC offers a few plugins. You can install them via Help -> Install New Software menu (you may not know that plugins exist and never go there :( ). Note that each plugin may have its own license, so be careful and read the licenses. I will give an overview of “JOverflow Analysis” plugin in this article – it looks for a list of inefficient memory usage patterns in your app heap.

Realtime process monitoring

You can attach to a JVM by right-clicking on it in the JVM Browser tab of the main window and choosing “Start JMX Console” menu option. You will see the following screen. There is nothing fancy here, just pay attention to the “+” buttons which allow you to add more counters to this screen.

Main monitoring screen

Main monitoring screen


Event triggers

Have you noticed that tabs on the bottom of the main screen? That’s there the most interesting features are hiding! The first powerful feature of JMC are event triggers. Triggers allow you to run various actions in response to certain JMX counters exceeding and (optionally) staying above the threshold for a given period of time.

For example, it will allow you to write HPROF memory dump when you are getting close to memory limit instead of doing it on OOM (which is supported by standard JVM options for a long time). Or you can trigger the JFR recording in case of long enough high CPU activity in order to understand what component is causing it (and you are not limited to a single recording!).

Note that triggers are working on any JMX counter (do you see the “Add…” button?) – you can setup more triggers than available in the standard distribution and export the settings on disk. You can even work with your own application JMX counters.

Triggers screen

Triggers screen


Go to the “Action” tab in the “Rule Details” window – here you can specify what action do you want to execute in case of event.

Triggers actions

Triggers actions

While HPROF dump and running a JFR recording seem to be most useful to me, you are definitely not limited to a single command per event – for example, you may want to make a dump and send yourself an email in case of some event, so you can further investigate now. In this case you need to need to duplicate a trigger rule – click “Add…” button and create another rule for the same JMX counter.

Duplicated trigger

Duplicated trigger

Note that you need to run your app in at least Java 7 update 40 if you want to properly use JFR – I was not able to record any events from JREs prior to Java7u40 (maybe this was a bug or incompatibility between certain JRE versions…).

Memory tab

The next tab – “Memory” will provide you the summary information about your application heap and garbage collection. Note that you can run the full GC and request a heap dump from this page (highlighted on the screen shot). But in essence this page is just a nice UI around the functionality available via the other sources.

Memory screen

Memory screen

Threads tab

Threads tab may seem pretty useless at the first glance, but it is actually a hidden treasure. This tab allows you to see a list of running threads in your app with their current stack dumps (updated once a second). It also lets you see:

  • Thread state – running or blocked / waiting
  • Lock name
  • If a thread is deadlocked
  • A number of times a thread was blocked
  • Per thread CPU usage!
  • Amount of memory allocated by a given thread since it was started

As a result, you can see which threads are running, what are they doing, if they are blocked (and on what lock), how much CPU and memory load they create. Isn’t it all what you want to quickly find out about your app??? :)

Remember that you have to turn on CPU profiling, deadlock detection and memory allocation tracking to obtain that information in realtime mode:

Threads screen

Threads screen

Using Java Flight Recorder

Java Flight Recorder (we will call it JFR in the rest of this article) is a JMC feature which will likely replace your favorite profiler. From the user point of view, you run the JFR with a fixed recording time / maximum recording file size / maximum recording length (your app can finish before that) and wait until recording is complete. After that you analyze it in the JMC.

How to run JFR

You need to add 2 following options to the JVM you want to connect to:

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder

This is a rather frustrating if you have to connect to an already running JVM. Luckily, JMC 5.5+ (shipped with Java 8u40+) is able to turn on these two parameters on the already running JVM.

Next, if you want to get anything useful from JFR, you need to connect to Java 7u40 or newer. Documentation claims that you can connect to any JVM from Java 7u4, but I was not able to get any useful information from those JVMs.

The third thing to keep in mind that by default JVM allows to make stack traces only at safe points. As a result, you may have incorrect stack trace information in some situations. JFR documentation tells you to set 2 more parameters if you want the more precise stack traces (you will not be able to set those parameters on the running JVM):

-XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints

Finally, if you want as much file I/O, Java exceptions and CPU profiling info available, ensure that you have selected parameters enabled and their thresholds set to “1 ms”:

File/Exceptions JFR settings

File/Exceptions JFR settings


Profiling JFR settings

Profiling JFR settings

JFR Initial Screen

JFR Initial screen

JFR Initial screen

The initial screen of JFR recording contain CPU and heap usage charts over the recording periods. Treat it just as an overview of your process. The only thing you should notice on this (and other JFR screens) is the ability to select a time range to analyze via any chart. Tick “Synchronize Selection” checkbox to keep the same time range on each window – it will allow you to inspect events happened at this range only.

There is one more interesting feature on this screen: “JVM Information” tab at the bottom contains values of all JVM parameters set in the profiled JVM. You can obtain them via -XX:+PrintFlagsFinal JVM option, but getting them remotely via UI is more convenient:

All JVM parameters

All JVM parameters

JFR Memory tab

The memory tab provides you the information about:

  • Machine RAM and Java heap usage (you can easily guess if swapping or excessive GC happened during the recording).
  • Garbage collections – when, why, for how long and how much space was cleaned up.
  • Memory allocation – from/outside TLAB, by class/tread/stack trace.
  • Heap snapshot – number/amount of memory occupied by class name

Essentially, this tab will allow you to check the memory allocation rate in your app, the amount of pressure it puts on GC and which code paths are responsible for unexpectedly high allocation rate. JFR also has its own very special feature – it allows to track TLAB and global heap allocations separately (TLAB allocations are much faster, because they do not require any synchronization).

In general, your app will get faster if:

  • It allocates less objects (by count and amount of allocated RAM)
  • You have less old(full) garbage collections, because they are slower and require stopping the world (at least for some time)
  • You have minimized non-TLAB object allocations

Let’s see how you can monitor this information. An “Overview” tab shows you the general information about memory consumption/allocation/garbage collection.

JFR Memory Overview tab

JFR Memory Overview tab

You can see here how far is “Committed Heap” from “Reserved Heap”. It shows you how much margin do you have in case of input spikes. The blue line (“Used Heap”) shows how much data is leaking/staying in the old generation: if your saw pattern is going up with each step – your old generation is growing. The lowest point of each step approximately shows the amount of data in the old generation (some of it may be eligible for garbage collection). The pattern on the screenshot tells that an application is allocating only the short-living objects, which are collected by the young generation GC (it may be some stateless processing).

You can also check “Allocation rate for TLABs” field – it shows you how much memory is being allocated per second (there is another counter called “Allocation rate for objects”, but it should be pretty low in general). 126 Mb/sec (in the example) is a pretty average rate for batch processing (compare it with a HDD read speed), but pretty high for most of interactive apps. You can use this number as an indicator for overall object allocation optimizations.

3 following tabs: “Garbage Collections”, “GC Times” and “GC Configuration” are pretty self evident and could be a source of information about reasons of GCs and the longest pauses caused by GC (which affect your app latency).

Allocations tab

“Allocations” tab provides you with the information about all objects allocations. You should go to the “Allocation in the new TLAB” tab. Here you can see the object allocation profiles per class (which class instances are being allocated), per thread (which threads allocate most of objects) or per call stack (treat it as a global allocation information).

Allocations tab

Allocations tab

Allocation by Class

Let’s see what you can find out from each of these tab. The first one (it is on the screenshot above), “Allocation by Class” lets you see which classes are allocated most of all. Select a type in the middle tab and you will get allocation stats (with stack traces) for all allocations of this class instances.

The first check you should make here is if you can find any “useless” object allocations: any primitive wrappers like Integer or Double (which often indicate use of JDK collections), java.util.Date, java.util.GregorianCalendar, Pattern, any formatters, etc. I have written some memory tuning hints in the second part of my recent article. “Stack Trace” tab will let you find the code to improve.

Another problem to check is the excessive object allocations. Unfortunately, no general advices could be given here – you should use your common sense to understand what “excessive” means in your application. The common issues are useless defensive copying (for read-only clients) and excessive use of String.substring since the String class changes in Java 7u6.

Allocation by Thread

“Allocation by Thread” tab could be interesting if you have several data processing types of threads in your application (or you could distinguish which tasks are run by which threads) – in this case you can figure out the object allocations per thread:

Allocation by Thread tab

Allocation by Thread tab

Allocation Profile

If all your threads are uniform (or you just have a one data processing thread) or you simply want to see the high level allocation information, then go to “Allocation Profile” tab directly. Here you will see how much memory have been allocated on each call stack in all threads.

Allocation Profile

Allocation Profile

This view allows you to find the code paths putting the highest pressure on the memory subsystem. You should distinguish the expected and excessive allocations here. For example, if from method A you call method B more than once and method B allocates some memory inside it and all invocations of method B are guaranteed to return the same result – it means you excessively call method B. Another example of excessive method calls/object allocation could be a string concatenation in the Logger.log calls. Finally, beware of any optimizations which force you to create a pool of reusable objects – you should pool/cache objects only if you have no more than one stored object per thread (the well known example is ThreadLocal<DateFormat>).

JFR Code Tab

The next large tab in the JFR view is the “Code” tab. It is useful for CPU optimization:

JFR Code tab

JFR Code tab

The overview tab provides you with 2 views: “Hot packages”, where you can see time spent per Java package and “Hot classes”, which allows you to see the most CPU expensive classes in your application.

“Hot packages” view may be useful if you use some 3rd party libs over which you have very little control and you want a CPU usage summary for your code (one package), 3rd party code (a few other packages) and JDK (a few more packages). At the same time, I’d call it “CIO/CTO view”, because it is not interactive and does not let you to see which classes from those packages are to blame. As a developer, you’d better use filtering on most of other tables in this tab:

Filtering by class name pattern in the Code tab

Filtering by class name pattern in the Code tab

Hot Methods / Call Tree tabs

“Hot Methods” and “Call Tree” tabs are the ordinary views provided by literally any Java profiler. They show your app hot spots – methods where your application has spent most of time as well as code paths which lead to those hot spots. You should generally start your app CPU tuning from “Hot Methods” tab and later check if an overall picture is sane enough in the “Call Tree” tab.

You should be aware that all “low impact” profilers are using sampling to obtain CPU profile. A sampling profiler makes a stack trace dump of all application threads periodically. The usual sampling period is 10 milliseconds. It is usually not recommended to reduce this period to less than 1 ms, because the sampling impact will start getting noticeable.

As a result, the CPU profile you will see is statistically valid, but is not precise. For example, you may be unlucky to hit some pretty infrequently called method right at the sampling interval. This happens from time to time… If you suspect that a profiler is showing you the incorrect information, try reorganizing the “hot” methods – inline the method into its caller on the hottest path, or on the contrary, try to split the method in 2 parts – it may be enough to remove a method from the profiler view.

Hot methods tab

Hot methods tab

Exceptions tab

“Exceptions” tab is the last tab in the “Code” view which worth attention in the general optimization case. Throwing Java exceptions is very slow and their usage must be strictly limited to the exceptional scenarios in the high performance code.

Exceptions view will provide you the stats about the number of exceptions which were thrown during recording as well as their stack traces and details. Go through the “Overview” tab and check if you see:

  • Any unexpected exceptions
  • Unexpected number of expected exceptions

If you see anything suspicious, go to “Exceptions” tab and check the exceptions details. Try to get rid of at least the most numerous ones.

Exceptions tab

Exceptions tab

JFR Threads Tab

JFR Threads Tab provides you the following information:

  • CPU usage / Thread count charts
  • Per thread CPU profile – similar to the one on the Code tab, but on per thread basis
  • Contention – which threads were blocked by which threads and for how long
  • Latencies – what caused application threads to go into the waiting state (you will clearly see some JFR overhead here)
  • Lock instances – locks which have caused thread contention

I would not cover this tab in details in this article, because you need this tab only for pretty advanced optimizations like lock stripping, atomic / volatile variables, non-blocking algorithms and so far.

JFR I/O Tab

I/O Tab should be used for inspection of file and socket input/output in your application. It lets you see which files your application was processing, what were the read/write sizes and what time did it take to complete the I/O operation. You can also see the order of I/O events in your app.

As with the most of other JFR tabs, you need to interpret the output of this tab yourself. Here are a few example questions you could ask yourself:

  • Do I see any unexpected I/O operations (on files I don’t expect to see here)?
  • Do I open/read/close the same file multiple times?
  • Are the read/write block sizes expected? Aren’t they too small?

Please note that it is highly recommended to reduce “File Read Threshold” JFR parameter (you can set it up while starting the JFR recording) to 1 ms if you are using an SSD. You may miss too many I/O events on SSD with the default 10 ms threshold:

Set a smaller File Threshold for SSDs

Set a smaller File Threshold for SSDs

I/O “Overview” tab is great, but it does not provide you any extra information compared to the following 4 specialized tabs. Each of 4 specialized tabs ( File read/write, Socket read/write) are similar to each other, so let’s look just at one of them – “File Read”.

By File Tab

By File Tab

There are 3 tabs here: “By File”, “By Thread” and “By Event”. The 2 first tabs group operations by file and by thread. The last tab simply lists all I/O events, but it may be pretty useful if you are investigating which operations were made on particular file (filter by “Path”) or if you want to figure out if you have made read requests for short chunks of data (sort by “Bytes Read”), which hurt the application performance. In general, you should always buffer the disk reads, so that only the file tail read would be shorter than a buffer size.

Note that the I/O information is collected via sampling as well, so some (or a lot) of file operations will be missing from “I/O’ tab. This could be especially noticeable on the top range SSDs.

There is one more related screen which will allow you to group I/O (and some other) events by various fields. For example, you may want to find out what number of read operations have read a given number of bytes (and check their stack traces). Go to “Events” tab on the left of JFR view and then to the very last tab called “Histogram”.

Here you can filter/sort/group various events by the available columns. Each JFR event has a related stack trace, so you can see the stack trace information for the selected events:

Sorting/Filtering/Grouping on the Events/Histogram tab

Sorting/Filtering/Grouping on the Events/Histogram tab

There is one basic performance tuning area not covered by JFR: memory usage antipatterns, like duplicate strings or nearly empty collections with a huge capacity. JFR does not provide yu such information because you need a heap dump to make such analysis. That’s why you need the JMC plugin called “JOverflow Analysis”.

JOverflow Analysis

As I have written above, the main and only purpose of “JOverflow Analysis” is to provide you with the information in regards to inefficient memory usage in your application. You can run it via “Dump Heap” drop down menu from the JMC “JVM Browser” or (you may not realize it) you can just open an HPROF file via “Open file” menu! If you remember the Triggers section of JMX Console earlier in this article, one of available actions was “HPROF Dump” – this is yet another way to obtain an HPROF file.

Just for your reference: the original way to generate an HPROF file is to run a jmap JDK tool. Start it parameterless to obtain the command line options. Here is an example of a command which makes a heap dump of a process id = 2976:

jmap -dump:format=b,live,file=your_file_name 2976

By the way, this is one of rare JMC features which does not require a recent JVM in the client process. You can make an HPROF dump on an older JVM and process it on the modern JMC.

JOverflow plug-in main page

JOverflow plug-in main page

After opening the HPROF file (which may take a pretty long time and require lot of CPU power for multi gigabyte heap dumps), you will see the JOverflow main (and only) screen. The left-top tab contains all found memory issues (I have used a tiny test application for this example – you will see more patterns on the complex heaps). Also pay attention to the “Reset” button in the top right corner – you will use it pretty often to reset the view.

The usage of this screen is slightly not intuitive from the first glance and could be a little frustrating… Each table on this screen is interactive, but I advise you not to use the top-right table for selection – you can revert its selection only via a “Reset” button in the top-right corner.

Actions for the initial view

Let’s see what happens if you click those tables from the initial state of the window.

If you click on the top-left tab, which lists the memory anti-patterns, you will select all objects matching this anti-pattern in the other tabs. Clicking on the top-right table will leave only instances of classes referenced by a given class instances. This view will also show paths to GC roots. Unfortunately, you can not reset selection from this table, you have to press the “Reset” button.
Clicking in the “Class” table will leave only those anti-patterns which were detected for the given class instances. You can reset the selection by clicking on the button which will appear next to the “Class” table:

How to reset Class selection

How to reset Class selection

Clicking on the bottom-right table “Ancestor referrer” will have the same effect as clicking on the top-right table – it will select all objects referred from the instances of the given class. Luckily, this view could be independently reset by a button appearing next to this table:

How to reset Ancestor referrer selection

How to reset Ancestor referrer selection

Clicking on the issues in the top-left table will show you the class names in the bottom-left table except 2 cases: “Duplicate strings” and “Duplicate arrays”. In these cases the bottom-left table will get renamed into “Duplicate” and will show you the actual duplicated string/array contents. Nevertheless, the working principles of this window will not change for these 2 special cases.

Duplicate Strings table contents

Duplicate Strings table contents

Fixing the JOverflow memory issues

This final section will provide you with a brief overview on fixing some of the problems shown by JOverflow Analyser.

Name Overhead Solution
Arrays with one element 28 bytes (4 – array reference, 24 – array contents) If your API requires array usage, check if you can reuse some of them. Change the API if possible to accept single elements.
Arrays with underused elements Element size * number of unused elements
  • If you are using any primitive collection libraries, these arrays are the result of underutilised map capacity. Sometimes you can increase the map fill factor, sometimes you can do nothing.
  • You may be using these arrays directly from your program. Check if some of them were allocated with too generous size.
Boxed collections Depends on the collection type
I have covered the overhead of JDK boxed collections here and here. There is no excuse for using the basic JDK boxed collections like ArrayList or HashMap/HashSet when there are so many alternatives available.
Duplicate arrays Depends on the duplicated array size Duplicate arrays are often a sign of duplicated higher level objects (for example, you have loaded the same bitmap twice). I have also seen cases when an application ended up with a lot of duplicate arrays in some read-only structures – some sort of canonicalization after a structure was built will help you to get rid of excessive copies.
Duplicate strings A string occupies 40+len*2 bytes. The more duplicates you have – the worse. This is usually the first memory usage issue I am looking at. Check the paths to GC roots to see where the duplicate strings are stored. Then consider interning them. A few lucky ones using Java 8u20 or newer can try using string deduplication as an easier (but less efficient) alternative.
Empty arrays Element size * number of unused elements Arrays where no element has non-default value. Most likely a caused by an unused field / unused parent object. See if you can lazily initialize a field.
Empty unused collections See the overhead of boxed collection above. I expect that most of unused collections in your application would be of 2 types: ArrayList and HashMap. This pair of classes was optimized for this “scenario” in Java 7u40. Keep in mind JDK Collections.empty* methods when dealing with this issue – all these methods return the singleton object.
Small collections An overhead of collection storage (see Boxed Collections above) Use JDK Collections.singleton* methods for read-only single element collections. Check that you do not allocate too generous capacity for maps of 2+ elements.
Zero Size Arrays 16 bytes per instance You must declare a constant for any empty arrays you are using. There is no excuse about allocating arrays of zero size multiple times.

The post Oracle Java Mission Control Overview appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/oracle-java-mission-control-overview/feed/ 3
Implementing a world fastest Java int-to-int hash map* http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/ http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/#comments Sun, 08 Mar 2015 05:00:49 +0000 http://java-performance.info/?p=889 by Mikhail Vorontsov * Fastest among int-to-int map implementations I have tested in my previous article in the tests I have implemented for that article. I would like to thank Sebastiano Vigna and Roman Leventov for sharing their hash map wisdom with me. Some implementation ideas were inspired by “Code Optimization: Effective Memory Usage” by […]

The post Implementing a world fastest Java int-to-int hash map* appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

* Fastest among int-to-int map implementations I have tested in my previous article in the tests I have implemented for that article.

I would like to thank Sebastiano Vigna and Roman Leventov for sharing their hash map wisdom with me. Some implementation ideas were inspired by “Code Optimization: Effective Memory Usage” by Kris Kaspersky .

This article will give you a step by step overview of various implementation tricks used in the modern hash map implementations. At the end of this article you will have a probably fastest Java int-to-int hash map implementation available at the moment of writing of this article.

Open indexing

Most of modern hash maps are based on the idea of open indexing. What does it mean? Your map is based on the array of keys (values will always be placed at the matching array index, so forget about them for now). You have to find your key in the array of keys for each map operation. How does it implemented?

First of all, you need the initial lookup position in the array. It may be calculated by any function which maps a key into an integer in the range [0, array.length - 1]. A key is usually mapped into an integer by means of hashCode method. A simplest function here could be Math.abs(key.hashCode() % array.length) (keep in mind that % result could be negative).

As you understand, a mapping of large set of keys into a small set of integer values means that you may end up with some collisions (they are called hash collisions) – same results of the initial function for the different keys. Collisions are resolved by trying to apply another function to the original array index. The simplest of such functions is (prevIdx + 1) % array.length. There is one requirement for such functions – if applied in a loop, they should cover the whole set or array cells, so that you can use the whole array capacity. Another example of such function is incrementing the index by one prime number if the array length is another prime number.

Free and removed cells

In theory, that’s enough to implement your own hash map. In practice, you need to distinguish free and removed cells from occupied cells (you can avoid using removed cells if you’ll do extra work in remove method – see how it is implemented in the latest FastUtil). Removed cells are also known as “tombstones”.

Your keys array is initially filled with free “cells”. You set a cell into “removed” state if you need to remove an existing key.

Let’s take a look at an example:

Open indexing example

Open indexing example

This int key map uses the initial and next functions defined above:

1
2
initial = Math.abs( key % array.length );
nextIdx = ( prevIdx + 1 ) % array.length;
initial = Math.abs( key % array.length );
nextIdx = ( prevIdx + 1 ) % array.length;

This map originally contained keys 1, 2, 3 and 4, but key=3 was subsequently removed from a map, so it was replaced with a removed (“R”) placeholder.

Let’s see what should we do to find the following keys:

Key Description
2 Start function points at a cell with index=2 at once. We have key=2 at a cell with index=2, so no further lookup is required.
3 Start function points at a cell with index=3. This cell is “removed”, so we have to apply “nextIdx” function in a loop until we either find a key or a free cell. We check cell index=4 next – bad luck, key is not equal. Then we check cell index=5: it is a free cell, so we stop the lookup – key is not found.

Next, let’s see what will happen if we want to add key=10: initial = key % array.length = 10 % 9 = 1. Cell at index=1 is already occupied with another key, so we can not use it. So is cell at index=2. The cell at index=3 is “removed”, so we can reuse it and put key=10 into it.

Removed cells cleanup

In many cases your hash map may degrade to O(n2) complexity if you would keep the removed cells in the map. Fastest maps are implementing the removed cells cleanup one way or another. As a result, all other map methods will need to distinguish just 2 cell states: free or used. Besides that, remove method is usually called infrequently compared to get and less frequently than put, which means that some extra complexity during key removal will be paid off by fasted execution of other methods. This article will use FastUtil cleanup logic.

Key scrambling

The initial index function I have mentioned above ( initial = Math.abs( key % array.length ); ) will put consecutive keys in the consecutive array cells. This is highly undesirable if your next cell function is simply picking up the next array cell, because it will cause the long lookup chains to be created in a pretty common case.

In order to avoid it, we need to “scramble” the key, shuffling its bits. I will rely on FastUtil scrambling code:

1
2
3
4
5
6
private static final int INT_PHI = 0x9E3779B9;
 
public static int phiMix( final int x ) {
    final int h = x * INT_PHI;
    return h ^ (h >> 16);
}
private static final int INT_PHI = 0x9E3779B9;

public static int phiMix( final int x ) {
    final int h = x * INT_PHI;
    return h ^ (h >> 16);
}

As a result, consecutive keys will not be placed in the consecutive array cells, thus keeping the average hash chain length under control. As for “random” keys case, you are likely to end up with a pretty good distribution of keys over the keys array as well.

Now you are definitely ready to implement your own hash map. We will be implementing an int-int map in the next several sections of this article.


Version 1: Base int-int map

Let’s start from a simplest possible implementation (which will leave us enough space for optimization). This implementation will look similar to Trove 3.0 TIntIntHashMap (though I was not copying its source code while writing this example).

It will use 3 arrays: one int[] for keys, one int[] for values and one boolean[] for cell usage flags (true==used). We will allocate the arrays of requested size (it will be size / fillFactor + 1), despite the fact that all production quality implementations are rounding this number up. Initial and nextIdx functions are the simplicity itself:

1
2
initial = Math.abs( Tools.phiMix(key) % array.length);
nextIdx = ( prevIdx + 1 ) % array.length;
initial = Math.abs( Tools.phiMix(key) % array.length);
nextIdx = ( prevIdx + 1 ) % array.length;

You can find the source code of this and all other maps at the end of this article.

I will compare each implementation results to the previous article results. Let’s compare it to the fastest implementation – Koloboke (note that I will give only get test results for the intermediate versions of a map to save space – you can find the full test results at the end of an article). All tests for this article are running on the random key set. All maps have the fill factor set to 0.75.

Map size: 10.000 100.000 1.000.000 10.000.000 100.000.000
tests.maptests.primitive.KolobokeMutableMapTest 1867 2471 3129 7546 11191
tests.maptests.article_examples.IntIntMap1Test 2768 3671 6105 12313 16073

That’s already great! The very first unoptimized attempt to implement a hashmap is less than 2 times slower than the Koloboke implementation.

Version 2: avoiding expensive % operation – arrays capacity is power of 2 now

Surprisingly many people think that old wisdom about extremely slow integer division/modulo operations is no longer valid – new CPUs are so smart and fast!!! Unfortunately this is still incorrect – both integer division operations are pretty slow and should be avoided in the performance critical code.

Previous version of a hashmap used % operation for both start and next index calculations, so it was guaranteed to be executed at least once per hash map method call. We can avoid it if our arrays size would be equal to a power of 2 (first power of 2 higher than the expected capacity). In this case our simple index functions could be transformed into:

1
2
initial = Tools.phiMix( key ) & (array.length-1);
nextIdx = ( prevIdx + 1 ) & (array.length-1);
initial = Tools.phiMix( key ) & (array.length-1);
nextIdx = ( prevIdx + 1 ) & (array.length-1);

array.length-1 should be, of course, cached into a separate mask class field. Why array.length-1 could be used as a mask? It is a known fact that if K = 2^N, then X % K == X & (K - 1). Using & operation gives us an extra benefit of always non-negative results (highest bit is always cleared by such masks), which further speeds up the calculation.

Keep in mind that all high performance hash maps are relying on this optimization.

Let’s compare this implementation results with the previous maps:

Map size: 10.000 100.000 1.000.000 10.000.000 100.000.000
tests.maptests.primitive.KolobokeMutableMapTest 1867 2471 3129 7546 11191
tests.maptests.article_examples.IntIntMap1Test 2768 3671 6105 12313 16073
tests.maptests.article_examples.IntIntMap2Test 2254 2767 4869 10543 16724

This optimization alone allowed us to make half a way from the slowest to the fastest implementation! Nevertheless, there is a long way ahead of us.

Version 3: getting rid of m_used array

The previous version of a map used 3 separate arrays to store the map data. It means that for a random key a map has to access 3 different areas of memory, each time likely causing a CPU cache miss. High performance code should minimize the possible number of CPU cache misses it causes on the normal operation path. The simplest optimization we can do is to get rid of m_used array and encode the cell usage flag in a smarter way.

The problem is that we are implementing an int-to-int map, so we may expect any int to be used as a key (how pathetic would be a general purpose hashmap stating that some key N is reserved and could not be used…). It means that we need some extra storage for usage flags, isn’t it? Yes, it is. But the point is that we can use O(1) instead of O(n) bytes for this storage!

The idea is to choose one special key value for free cells. There are 2 known strategies after that (I will use the first one):

  1. Store a value matching to the free cell in a separate field. You also need some indication if a free key is actually used (a boolean and an int or just Boolean is fine for this purpose). Each map method should check if a key argument is equal to a free key prior to the normal logic and act accordingly.
  2. Pick a random free key. If you are requested to insert it in the map, select a new random free key, which is not present in the map so far (and replace all old free keys with a new value). Now you need to store the free key instead of its corresponding values, but you would not get any other benefits. Koloboke is the only implementation which uses this strategy.

As a side note I want to mention that dealing with free keys on maps with Object keys is much easier:

1
private static final Object FREE_KEY = new Object();
private static final Object FREE_KEY = new Object();

This key is not accessible to other classes, so it could never be passed into your map implementation as a key. Those smart guys who remember about reflection are reminded that the hash map keys must properly implement equals and hashCode methods, which is not the case for this private object :)

As I have mentioned before, our implementation will use a hardcoded free key (0, which is a cheapest to compare constant) and store the corresponding value in the separate fields:

1
2
3
4
5
6
7
8
9
10
11
private static final int FREE_KEY = 0;
 
/** Keys */
private int[] m_keys;
/** Values */
private int[] m_values;
 
/** Do we have 'free' key in the map? */
private boolean m_hasFreeKey;
/** Value of 'free' key */
private int m_freeValue;
private static final int FREE_KEY = 0;

/** Keys */
private int[] m_keys;
/** Values */
private int[] m_values;

/** Do we have 'free' key in the map? */
private boolean m_hasFreeKey;
/** Value of 'free' key */
private int m_freeValue;

Let’s compare this implementation results with the previous maps:

Map size: 10.000 100.000 1.000.000 10.000.000 100.000.000
tests.maptests.primitive.KolobokeMutableMapTest 1867 2471 3129 7546 11191
tests.maptests.article_examples.IntIntMap1Test 2768 3671 6105 12313 16073
tests.maptests.article_examples.IntIntMap2Test 2254 2767 4869 10543 16724
tests.maptests.article_examples.IntIntMap3Test 2050 2269 3548 9074 13750

As you can see, the impact of this change grows with the map size (the bigger is your map, the less help you get from the CPU cache). Nevertheless, we are still pretty far from the Koloboke implementation. But actually this map would already be the third in my previous article, after Koloboke and FastUtil.

Versions 4 and 4a – replacing arrays of keys and values with a single array

This step follows the direction of the previous step – now we want to use a single array to store keys and values. This will allow us to be able to access/modify values at a very low cost, because they will be located next to the keys.

There are 2 possible implementations in case of int-to-int map:

  1. Use long[] – a key and a value will share the single long cell. Usefulness of this method is limited to certain types of keys and values.
  2. Use a single int[] – keys and values will be interleaved (the only meaningful interleaving strategy is storing a value right after a key). This strategy has a small disadvantage of limiting the maximal map capacity to 1 billion cells (the maximal array size in Java is equal to Integer.MAX_VALUE). I believe this is not a problem for the most of use cases.

The difference between 2 those scenarios is the need to use bit arithmetic / type conversions to extract a key and a value from a long cell. My tests have shown that these operations have a noticeable negative impact on the map performance. Nevertheless, I have included both long[] (IntIntMap4) and int[] (IntIntMap4a) versions to this article.

Micro optimizations

Both versions would be pretty fast, but you need more to become the fastest. The important thing you should understand about a hashmap that its basic operations have O(1) complexity provided that you were not too greedy with the fill factor. It actually means that you must count the instructions on the hash hit path (the very first cell you check if either free or contains the requested key). Optimizing a hash collision loop is also important, but (I repeat this) you must be very careful on the hash hit path, because most of your operations will end up with a hash hit.

Having this in mind, you may want to inline some of helper methods, especially the ones which could save you an instruction or two if inlined. Take a look, for example, at the get method from the previous map version:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;
 
    final int idx = getReadIndex( key );
    return idx != -1 ? m_values[ idx ] : NO_VALUE;
}
 
private int getReadIndex( final int key )
{
    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return idx;
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return -1;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return -1;
        if ( m_keys[ idx ] == key )
            return idx;
    }
    return -1;
}
public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;

    final int idx = getReadIndex( key );
    return idx != -1 ? m_values[ idx ] : NO_VALUE;
}

private int getReadIndex( final int key )
{
    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return idx;
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return -1;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return -1;
        if ( m_keys[ idx ] == key )
            return idx;
    }
    return -1;
}

The first check (key == FREE_KEY) is unlikely to happen, so it will be predicted correctly by CPU in most of cases. Essentially, it can be excluded from consideration here.

Next you get the cell index to use from a helper getReadIndex method. It is perfectly object-oriented, but it causes an extra check ( return idx != -1 ? m_values[ idx ] : NO_VALUE ) to be made. This check, unlike previous ones, is unpredictable unless your map is essentially read-only and most of your get calls are made for the existing keys. Unpredictable checks should be avoided on the critical path (if possible).

It is easy to achieve in this case – just inline the body of getReadIndex into get:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;
 
    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return m_values[ idx ];
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return NO_VALUE;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return NO_VALUE;
        if ( m_keys[ idx ] == key )
            return m_values[ idx ];
    }
    return NO_VALUE;
}
public int get( final int key )
{
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;

    int idx = getStartIndex( key );
    if ( m_keys[ idx ] == key ) //we check FREE prior to this call
        return m_values[ idx ];
    if ( m_keys[ idx ] == FREE_KEY ) //end of chain already
        return NO_VALUE;
    final int startIdx = idx;
    while (( idx = getNextIndex( idx ) ) != startIdx )
    {
        if ( m_keys[ idx ] == FREE_KEY )
            return NO_VALUE;
        if ( m_keys[ idx ] == key )
            return m_values[ idx ];
    }
    return NO_VALUE;
}

As you can see, the only transformation I did was the replacement of getReadIndex return values with get return values. You can do a few similar transformations here and there. It also worth to manually inline getStartIndex and getNextIndex calls – you can’t tell for sure if they will be inlined by JIT. As a result, you may end up with the following get method (taken from IntIntMap4a – an int[] version):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
public int get( final int key )
{
    int ptr = ( Tools.phiMix( key ) & m_mask) << 1;
 
    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;
 
    int k = m_data[ ptr ];
 
    if ( k == FREE_KEY )
        return NO_VALUE;  //end of chain already
    if ( k == key ) //we check FREE prior to this call
        return m_data[ ptr + 1 ];
 
    while ( true )
    {
        ptr = (ptr + 2) & m_mask2; //that's next index
        k = m_data[ ptr ];
        if ( k == FREE_KEY )
            return NO_VALUE;
        if ( k == key )
            return m_data[ ptr + 1 ];
    }
}
public int get( final int key )
{
    int ptr = ( Tools.phiMix( key ) & m_mask) << 1;

    if ( key == FREE_KEY )
        return m_hasFreeKey ? m_freeValue : NO_VALUE;

    int k = m_data[ ptr ];

    if ( k == FREE_KEY )
        return NO_VALUE;  //end of chain already
    if ( k == key ) //we check FREE prior to this call
        return m_data[ ptr + 1 ];

    while ( true )
    {
        ptr = (ptr + 2) & m_mask2; //that's next index
        k = m_data[ ptr ];
        if ( k == FREE_KEY )
            return NO_VALUE;
        if ( k == key )
            return m_data[ ptr + 1 ];
    }
}

As you can see, we have just the following operations on the hash hit path:

  • A check for free key - highly predictable => extremely cheap
  • 4 bit operations and a multiplication to calculate the start index
  • Extraction of a key - the only truly expensive operation, can't be avoided
  • Checks for free key / our key - your scenario may cause these checks to be predicted correctly, but they are not predictable in general case. Anyway, the arguments to compare will reside in L1 cache at worst, so both checks are cheap.
  • If key is found - extraction of a value. Our data layout ensures that a value will be fetched into L1 cache with a key, so reading a value is very cheap.

The hash hit path costs us:

  • 2 highly predictable comparisons with constants
  • 2 non predictable comparisons with constant / a key which is most likely already loaded into a register
  • 4 bit operations (all arguments are either constants or values located no further than in L1 cache)
  • 1 multiplication (unfortunately unavoidable if you need to shuffle the bits)
  • 1 memory read which is likely to cause CPU cache miss (loading a key to check)
  • 1 memory read which will be served out of L1 cache (after a previous mem read operation)

11 operations only to get a value by a key (and only one of them is likely to be expensive) - no wonder that a test is able to fetch 10 million keys a second in the worst case (when a map is too large and CPU cache can't help us and we also likely to check a few keys due to a large fill factor).

Test results

Now we will present the test results for all 5 map implementations and a Koloboke map as the fastest int-to-int map from my previous article.

"get" test results

Map size: 10.000 100.000 1.000.000 10.000.000 100.000.000
tests.maptests.primitive.KolobokeMutableMapTest 1867 2471 3129 7546 11191
tests.maptests.article_examples.IntIntMap1Test 2768 3671 6105 12313 16073
tests.maptests.article_examples.IntIntMap2Test 2254 2767 4869 10543 16724
tests.maptests.article_examples.IntIntMap3Test 2050 2269 3548 9074 13750
tests.maptests.article_examples.IntIntMap4Test 1902 2296 3229 7749 10661
tests.maptests.article_examples.IntIntMap4aTest 1738 22209 2927 6582 9969
int-int 'get' results

int-int 'get' results

"put" test results

Map size: 10.000 100.000 1.000.000 10.000.000 100.000.000
tests.maptests.primitive.KolobokeMutableMapTest 3262 4549 5600 12182 16140
tests.maptests.article_examples.IntIntMap1Test 9048 10555 11322 23004 28974
tests.maptests.article_examples.IntIntMap2Test 4305 4816 9435 19805 26770
tests.maptests.article_examples.IntIntMap3Test 3865 4063 7455 15274 21014
tests.maptests.article_examples.IntIntMap4Test 3562 4676 5866 12999 16304
tests.maptests.article_examples.IntIntMap4aTest 3411 4401 5374 11289 15095
int-int 'put' results

int-int 'put' results

"remove" test results

Map size: 10.000 100.000 1.000.000 10.000.000 100.000.000
tests.maptests.article_examples.IntIntMap1Test 8301 9142 9313 16507 24134
tests.maptests.article_examples.IntIntMap2Test 3915 3890 6227 13450 20456
tests.maptests.article_examples.IntIntMap3Test 3339 3270 4901 10425 16120
tests.maptests.article_examples.IntIntMap4Test 3098 3052 3988 8670 12019
tests.maptests.article_examples.IntIntMap4aTest 2870 2876 3823 8127 11503
tests.maptests.primitive.KolobokeMutableMapTest 2836 3042 3923 8228 12007
int-int 'remove' results

int-int 'remove' results

As you can see, the results are pretty obvious - IntIntMap4a is faster than a Koloboke int-int map, which was the fastest map in my previous article.

Summary

If you want to optimize your hash map for speed, you have to do as much as you can of the following list:

  • Use underlying array(s) with capacity equal to a power of 2 - it will allow you to use cheap & instead of expensive % for array index calculations.
  • Do not store the state in the separate array - use dedicated fields for free/removed keys and values.
  • Interleave keys and values in the one array - it will allow you to load a value into memory for free.
  • Implement a strategy to get rid of 'removed' cells - you can sacrifice some of remove performance in favor of more frequent get/put.
  • Scramble the keys while calculating the initial cell index - this is required to deal with the case of consecutive keys.

Source code

Classes written for this article are added to my previous hash map testing project on GitHub: https://github.com/mikvor/hashmapTest.

Please note you should run this project via tests.MapTestRunner class:

mvn clean install
java -cp target/benchmarks.jar tests.MapTestRunner

The post Implementing a world fastest Java int-to-int hash map* appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/implementing-world-fastest-java-int-to-int-hash-map/feed/ 4
Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove – January 2015 version http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/ http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/#comments Fri, 06 Feb 2015 13:00:40 +0000 http://java-performance.info/?p=866 by Mikhail Vorontsov This is a major update of the previous version of this article. The reasons for this update are: The major performance updates in fastutil 6.6.0 Updates in the “get” test from the original article, addition of “put/update” and “put/remove” tests Adding identity maps to all tests Now using different objects for any […]

The post Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove – January 2015 version appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

This is a major update of the previous version of this article. The reasons for this update are:

  • The major performance updates in fastutil 6.6.0
  • Updates in the “get” test from the original article, addition of “put/update” and “put/remove” tests
  • Adding identity maps to all tests
  • Now using different objects for any operations after map population (in case of Object keys – except identity maps). Old approach of reusing the same keys gave the unfair advantage to Koloboke.

I would like to thank Sebastiano Vigna for providing the initial versions of “get” and “put” tests.

Introduction

This article will give you an overview of hash map implementations in 5 well known libraries and JDK HashMap as a baseline. We will test separately:

  • Primitive to primitive maps
  • Primitive to object maps
  • Object to primitive maps
  • Object to Object maps
  • Object (identity) to Object maps

This article will provide you the results of 3 tests:

  • “Get” test: Populate a map with a pregenerated set of keys (in the JMH setup), make ~50% successful and ~50% unsuccessful “get” calls. For non-identity maps with object keys we use a distinct set of keys (the different object with the same value is used for successful “get” calls).
  • “Put/update” test: Add a pregenerated set of keys to the map. In the second loop add the equal set of keys (different objects with the same values) to this map again (make the updates). Identical keys are used for identity maps and for maps with primitive keys.
  • “Put/remove” test: In a loop: add 2 entries to a map, remove 1 of existing entries (“add” pointer is increased by 2 on each iteration, “remove” pointer is increased by 1).

This article will just give you the test results. There will be a followup article on the most interesting implementation details of the various hash maps.

Test participants

JDK 8

JDK HashMap is the oldest hash map implementation in this test. It got a couple of major updates recently – a shared underlying storage for the empty maps in Java 7u40 and a possibility to convert underlying hash bucket linked lists into tree maps (for better worse case performance) in Java 8.

FastUtil 6.6.0

FastUtil provides a developer a set of all 4 options listed above (all combinations of primitives and objects). Besides that, there are several other types of maps available for each parameter type combination: array map, AVL tree map and RB tree map. Nevertheless, we are only interested in hash maps in this article.

Goldman Sachs Collections 5.1.0

Goldman Sachs has open sourced its collections library about 3 years ago. In my opinion, this library provides the widest range of collections out of box (if you need them). You should definitely pay attention to it if you need more than a hash map, tree map and a list for your work :) For the purposes of this article, GS collections provide a normal, synchronized and unmodifiable versions of each hash map. The last 2 are just facades for the normal map, so they don’t provide any performance advantages.

HPPC 0.6.1

HPPC provides array lists, array dequeues, hash sets and hash maps for all primitive types. HPPC provides normal hash maps for primitive keys and both normal and identity hash maps for object keys.

Koloboke 0.6.5

Koloboke is the youngest of all libraries in this article. It is developed as a part of an OpenHFT project by Roman Leventov. This library currently provides hash maps and hash sets for all primitive/object combinations. This library was recently renamed from HFTC, so some artifacts in my tests will still use the old library name.

Trove 3.0.3

Trove is available for a long time and quite stable. Unfortunately, not much development is happening in this project at the moment. Trove provides you the list, stack, queue, hash set and map implementations for all primitive/object combinations. I have already written about Trove.

Data storage implementations and tests

This article will look at 5 different sorts of maps:

  1. intint
  2. intInteger
  3. Integerint
  4. IntegerInteger
  5. Integer (identity map)Integer

We will use JMH 1.0 for testing. Here is the test description: for each map size in (10K, 100K, 1M, 10M, 100M) (outer loop) generate a set of random keys (they will be used for each test at a given map size) and then run a test for each map implementations (inner loop). Each test will be run 100M / map_size times. “get”, “put” and “remove” tests are run separately, so you can update the test source code and run only a few of them.

Note that each test suite takes around 7-8 hours on my box. Spreadsheet-friendly results will be printed to stdout once all test suites will finish.

int-int

Each section will start with a table showing how data is stored inside each map. Only arrays will be shown here (some maps have special fields for a few corner cases).

tests.maptests.primitive.FastUtilMapTest int[] key, int[] value
tests.maptests.primitive.GsMutableMapTest int[] keys, int[] values
tests.maptests.primitive.HftcMutableMapTest long[] (key-low bits, value-high bits)
tests.maptests.primitive.HppcMapTest int[] keys, int[] values, boolean[] allocated
tests.maptests.primitive.TroveMapTest int[] _set, int[] _values, byte[] _states

As you can see, Koloboke is using a single array, FastUtil and GS use 2 arrays, and HPPC and Trove use 3 arrays to store the same data. Let’s see what would be the actual performance.

“Get” test results

All “get” tests make around 50% of unsuccessful get calls in order to test both success and failure paths in each map.

Each test results section will contain the results graph. X axis will show a map size, Y axis – time to run a test in milliseconds. Note, that each test in a graph has a fixed number of map method calls: 100M get call for “get” test; 200M put calls for “put” test; 100M put and 50M remove calls for “remove” tests.

There would be the links to OpenOffice spreadsheets with all test results at the end of this article.

int-int 'get' test results

int-int ‘get’ test results

GS and FastUtil test results lines are nearly parallel, but FastUtil is faster due to a lower constant factor. Koloboke becomes fastest only on large enough maps. Trove is slower than other implementations at each map size.

“Put” test results

“Put” tests insert all keys into a map and then use another equal set of keys to insert entries into a map again (these methods calls would update the existing entries). We make 100M put calls with “insert” functionality and 100M put calls with “update” functionality in each test.

int-int 'put' test results

int-int ‘put’ test results

This test shows the implementation difference more clear: Koloboke is fastest from the start (though FastUtil is as fast on small maps); GS and FastUtil are parallel again (but GS is always slower). HPPC and Trove are the slowest.

“Remove” test results

In “remove” test we interleave 2 put operations with 1 remove operation, so that a map size grows by 1 after each group of put/remove calls. In total we make 100M put and 50M remove calls.

int-int 'remove' test results

int-int ‘remove’ test results

Results are similar to “put” test (of course, both tests make a majority of put calls!): Koloboke is quickly becoming the fastest implementation; FastUtil is a bit faster than GS on all map sizes; HPPC and Trove are the slowest, but HPPC performs reasonably good on map sizes up to 1M entries.

int-int summary

An underlying storage implementation is the most important factor defining the hash map performance: the fewer memory accesses an implementation makes (especially for large maps which do not into CPU cache) to access an entry – the faster it would be. As you can see, the single array Koloboke is faster than other implementations in most of tests on large map sizes. For smaller map sizes, CPU cache starts hiding the costs of accessing several arrays – in this case other implementations may be faster due to less CPU commands required for a method call: FastUtil is the second best implementation for primitive collection tests due to its highly optimized code.


int-Object

tests.maptests.prim_object.FastUtilIntObjectMapTest int[] key, Object[] value
tests.maptests.prim_object.GsIntObjectMapTest int[] keys, Object[] values
tests.maptests.prim_object.HftcIntObjectMapTest int[] keys, Object[] values
tests.maptests.prim_object.HppcIntObjectMapTest int[] keys, Object[] values, boolean[] allocated
tests.maptests.prim_object.TroveIntObjectMapTest int[] _set, Object[] _values, byte[] _states

There are 2 groups here: FastUtil, GS and Koloboke are using 2 arrays; HPPC and Trove are using 3 arrays.

“Get” test results

int-Object 'get' test results

int-Object ‘get’ test results

As you can, see FastUtil and Koloboke are very close to each other, though FastUtil is consistently faster. GS and HPPC form the next group, where HPPC is slightly faster than GS (which is a surprise despite the extra underlying array). Trove is noticeably slower.

“Put” test results

int-Object 'put' test results

int-Object ‘put’ test results

FastUtil, Koloboke and GS are leaders in the “put” test (and FastUtil is a clear winner here). HPPC is getting slower than Trove on the large map sizes.

“Remove” test results

int-Object 'remove' test results

int-Object ‘remove’ test results

This picture is very similar to the previous one: nearly identical results of FastUtil and Koloboke; slightly slower GS; HPPC is pretty good on the smaller map sizes, but is getting slower on the large maps; Trove is slower than HPPC, but parallel to it.

int-Object summary

Extra byte/boolean array used by HPPC/Trove makes them predictably slower than 3 other implementations in this test. Once the underlying storage becomes identical, second order optimizations starts to make the difference. As a result, FastUtil is getting faster than other 2-array maps, and HPPC is getting close to 2-array maps on the smaller maps sizes, where the CPU cache can fit the whole/most of the map, so the extra array does not make a serious difference.

Object-int

tests.maptests.object_prim.FastUtilObjectIntMapTest Object[] key, int[] value
tests.maptests.object_prim.GsObjectIntMapTest Object[] keys, int[] values
tests.maptests.object_prim.HftcObjectIntMapTest Object[] keys, int[] values
tests.maptests.object_prim.HppcObjectIntMapTest Object[] keys, int[] values, boolean[] allocated
tests.maptests.object_prim.TroveObjectIntMapTest Object[] _set, int[] _values

Only HPPC is still using 3 arrays for Object-int mapping. Hopefully, this would be fixed in the next HPPC release.

“Get” test results

Object-int 'get' results

Object-int ‘get’ results

FastUtil is a leader in this test. Koloboke and GS are very close, though GS is running a little slower if a map no longer fits into CPU cache. HPPC is surprisingly faster than Trove…

“Put” test results

Object-int 'put' results

Object-int ‘put’ results

FastUtil, Koloboke and GS are very close to each other on the map sizes up to 1M, but you can see the difference after this point: FastUtil is the fastest, Koloboke is the second and GS is the third. Trove is a little slower than those 3 implementations, but much faster than HPPC, which behaves really bad on the very large map sizes.

“Remove” test results

Object-int 'remove' results

Object-int ‘remove’ results

This test is very similar to the previous one with the only difference: the gap between Trove and 3 fastest implementations is much bigger.

Object-int summary

This test is again highlighting the importance of having the minimal possible number of underlying arrays in a map implementation: once you have an extra one, your implementation becomes non-competitive. The next important lesson you can learn is to use the underlying arrays with a size equal to a power of 2. This allows you to use bit operations while calculating a lookup position in a map instead of an expensive mod (used by Trove).

Object-Object

tests.maptests.object.FastUtilObjMapTest Object[] key, Object[] value
tests.maptests.object.GsObjMapTest Object[] table – interleaved keys and values
tests.maptests.object.HftcMutableObjTest Object[] table – interleaved keys and values
tests.maptests.object.HppcObjMapTest Object[] keys, Object[] values, boolean[] allocated
tests.maptests.object.JdkMapTest Node<K,V>[] table – each Node could be a part of a linked list or a TreeMap (Java 8)
tests.maptests.object.TroveObjMapTest Object[] _set, Object[] _values

This group of tests will be bigger than the previous ones. We will see if Koloboke can make use of the fact (you can specify it in the factory) that there would be no null keys. We will also try to work around the JDK HashMap “feature” that there may be one rehashing before you will add the requested number of entries in the map (JDK implementation may not allocate arrays large enough to store the requested map size).

“Get” test results

Object-Object 'get' results

Object-Object ‘get’ results

This test results are pretty surprising:

  • Both JDK versions are the fastest (rehashing does not make a difference in this test because it happens at setup).
  • FastUtil and GS are close to JDK, but slightly slower. Nevertheless, they require less memory overhead, so they may still be considered as an option.
  • Koloboke is close to the above mentioned implementations on some map sizes, but slower on others. This is most likely due to the variable fill factor (Koloboke does not perform well if you try to set a fixed fill factor).
  • Surprisingly, Trove is slower than HPPC in this test despite an extra “allocated” array in the HPPC implementation.

“Put” test results

Object-Object 'put' results

Object-Object ‘put’ results

  • Koloboke implementation is the fastest in this test due to best possible storage structure.
  • FastUtil keeps up with Koloboke on the smaller map sizes (while the map fits in CPU cache), but becomes a little slower on large maps due to a less efficient memory access pattern.
  • GS, on the other hand, is slower than Koloboke and FastUtil on the smaller maps (due to less efficient code), but is getting closer to Koloboke once a map no longer fits in CPU cache.
  • As for JDK maps, you can see that a map with correct preallocated capacity is always faster than a default map (by default I mean a map where you specify just the right capacity in the constructor, say 10.000, instead of an inflated capacity of 10.000/fill_factor(0.5)=20.000). The difference is bigger on some sizes and smaller on the others, due to either one or 2 factors in effect: 1) you will always have a smaller chance of hash collisions in the bigger capacity map; 2) you may avoid rehashing by specifying a bigger capacity in the JDK map.
  • Trove is faster than HPPC in this test due to a 2-array underlying implementation.

“Remove” test results

Object-Object 'remove' results

Object-Object ‘remove’ results

  • Koloboke is the fastest in this test and GS is very close to GS. FastUtil and a proper capacity JDK map are very close, but slightly slower.
  • Default capacity JDK map is noticeably slower than the first 4 implementations.
  • It is followed by Trove (not far behind) and HPPC (too far behind).

Object-Object summary

The first lesson you should remember from this group of maps is that a JDK HashMap may not allocate enough storage for the requested number of elements. As a result, you may be penalised by rehashing. Nevertheless, JDK HashMap is extremely good if you mostly use it as a read-only storage.

The second lesson is that an underlying storage is still the most important factor affecting the hash map performance – an implementation should try to minimize a number of memory accesses for each operation and do not expect that a CPU cache would help it.

Identity maps

The most important property of identity maps is that they expect that the same object will be used for all accesses to the map. It means that an identity map will use == instead of yourObject.equals() and System.identityHashCode() instead of yourObject.hashCode().

Keep in mind that some non-identity maps also make == check prior to requested_key.equals(stored_map_key) check (for example, JDK and Koloboke implement such check) in hope that some previously inserted keys may be later used for queries. Pay attention if this is your application case.

tests.maptests.identity_object.FastUtilRef2ObjectMapTest Object[] key, Object[] value
tests.maptests.identity_object.GsIdentityMapTest Object[] table – interleaved keys and values
tests.maptests.identity_object.HftcIdentityMapTest Object[] table – interleaved keys and values
tests.maptests.identity_object.HppcIdentityMapTest Object[] keys, Object[] values, boolean[] allocated
tests.maptests.identity_object.JDKIdentityMapTest Object[] table – interleaved keys and values
tests.maptests.identity_object.TroveIdentityMapTest Object[] _set, Object[] _values

There are 3 groups here: JDK, Koloboke and GS use a single interleaved array, FastUtil and Trove use 2 arrays, finally HPPC uses 3 arrays.

“Get” test results

Object (identity)-Object 'get' results

Object (identity)-Object ‘get’ results

“Put” test results

Object (identity)-Object 'put' results

Object (identity)-Object ‘put’ results

“Remove” test results

Object (identity)-Object 'remove' results

Object (identity)-Object ‘remove’ results

All these tests are mode difficult to comment. We can see that Trove is simply slow and HPPC is penalised for the third underlying array. FastUtil, GS and JDK are consistently good. Koloboke is also good, but is surprisingly slower than most of implementations on the small maps sizes in “get” tests.

Summary

  • FastUtil 6.6.0 turned out to be consistently fast. It may become even faster if it would introduce any other storage structures except 2 arrays for keys and values.
  • Koloboke is getting second in many tests, but it still outperforms FastUtil in int-int tests.
  • GS implementation is good enough, but is slower than FastUtil and Koloboke.
  • JDK maps are pretty good for Object-Object maps provided that you can tolerate the extra memory consumption and you will call HashMap constructor with required capacity = actual_capacity / fill_factor + 1 to avoid rehashing.
  • Trove suffers from using mod operation for array index calculations and HPPC is too slow due to an extra underlying array (for cell states).

Source code

The article source code is now hosted at GitHub: https://github.com/mikvor/hashmapTest.

Please note you should run this project via tests.MapTestRunner class:

mvn clean install
java -cp target/benchmarks.jar tests.MapTestRunner

The full test set may take around 24 hours to complete. You need a computer with proper CPU cooling to run this test set, so it can sustain an hours long CPU load without throttling (small laptops are seldom designed for such load). You need 20G+ heap to run the 100M tests, so it makes sense to shrink MapTestRunner.TOTAL_SIZE to 10M if you want to use a commodity computer for testing.

Test results

Here are the article test results in form of OpenOffice spreadsheet files.

The post Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove – January 2015 version appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove-january-2015/feed/ 4
Going over Xmx32G heap boundary means you will have less memory available http://java-performance.info/over-32g-heap-java/ http://java-performance.info/over-32g-heap-java/#comments Wed, 31 Dec 2014 01:57:37 +0000 http://java-performance.info/?p=830 by Mikhail Vorontsov This small article will remind you what happens to Oracle JVM once your heap setting goes over 32G. By default, all references in JVM occupy 4 bytes on heaps under 32G. This decision is made by JVM at start-up time. You can use 8 byte references on small heaps if you will […]

The post Going over Xmx32G heap boundary means you will have less memory available appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

This small article will remind you what happens to Oracle JVM once your heap setting goes over 32G. By default, all references in JVM occupy 4 bytes on heaps under 32G. This decision is made by JVM at start-up time. You can use 8 byte references on small heaps if you will clear the -XX:-UseCompressedOops JVM option (it does not make any sense for production systems!).

Once your heap exceeds 32G, you are in 64 bit land, so your object references will now use 8 bytes instead of 4. As Scott Oaks mentioned in his
Java Performance: The Definitive Guide book (pages 234-236, read my review of this book here), an average Java program uses about 20% of heap for object references. It means that by setting anything between Xmx32G and Xmx37G – Xmx38G you will actually reduce the amount of heap available for your application (actual numbers, of course, depend on your application). This may become a big surprise for anyone thinking that adding extra memory will let his/her application to process more data :)


Test – populating a LinkedList<Integer>

I have decided to test the worst case scenario – populating a LinkedList<Integer> with increasing consecutive values. It is an interesting exercise: calculate how much heap do you need to insert 2 billion Integer values into a LinkedList. I’ll leave it to you :)

The test code is extremely simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class Mem32Test {
    public static void main(String[] args) {
        List<Integer> lst = new LinkedList<>();
        int i = 0;
        while ( true )
        {
            lst.add( new Integer( i++ ) );
            if ( ( i & 0xFFFF ) == 0 )
                System.out.println( i ); //shows where you are <img src="http://java-performance.info/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" />
            if ( i == System.currentTimeMillis() )
                break; //otherwise will not compile
        }
        System.out.println( lst.size() ); //needed to avoid dead code optimizations
    }
}
public class Mem32Test {
    public static void main(String[] args) {
        List<Integer> lst = new LinkedList<>();
        int i = 0;
        while ( true )
        {
            lst.add( new Integer( i++ ) );
            if ( ( i & 0xFFFF ) == 0 )
                System.out.println( i ); //shows where you are <img src="http://d1k2jhzcfaebet.cloudfront.net/wp-includes/images/smilies/simple-smile.png" alt=":)" class="wp-smiley" style="height: 1em; max-height: 1em;" />
            if ( i == System.currentTimeMillis() )
                break; //otherwise will not compile
        }
        System.out.println( lst.size() ); //needed to avoid dead code optimizations
    }
}

You should run this code with Xmx and verbose:gc (or -XX:+PrintGCDetails) options. You need to see the garbage collection logs to understand when you run out of memory (it may be pretty long before you’ll get an actual OOM).

First of all, I’ve found an exact spot where JVM switches to 64 bit references – it is Xmx32767M (surprisingly, 1Mb less than 32G). After that I have noticed that the amount of memory actually available to the application does not increase linearly with your heap. Instead it seems to grow in steps (see what happens between Xmx49200M and Xmx49500M) – this is something I want to investigate further.

Test results

Number of elements in the LinkedList Heap size
666,697,728 Xmx32700M
667,287,552 Xmx32730M
667,680,768 Xmx32750M
667,877,376 Xmx32760M
668,008,448 Xmx32764M
668,139,520 Xmx32765M
668,008,448 Xmx32766M
422,510,592 Xmx32767M
429,391,872 Xmx33700M
535,166,976 Xmx42000M
639,041,536 Xmx48700M
643,039,232 Xmx49200M
731,578,368 Xmx49500M
734,658,560 Xmx49700M
1,442,119,680 Xmx110000M

As you can see, a number of list elements drops dramatically from 668 millions to 422 millions at Xmx32767M due to switching to 64 bit references.

Let’s see why we have such a large drop in number of elements we can insert fit into a LinkedList. A JDK LinkedList is a double linked list. So, each Node contains prev and next references as well as a data item (an object reference as well).

Each Java object in 32 bit mode contains a 12 byte header followed by object fields. Each object memory consumption is aligned by 8 bytes. So, a Node occupies 12 + 4 * 3 = 24 bytes in 32 bit mode. An Integer needs 12 + 4 = 16 bytes (no alignment padding is required in both those cases).

Once you enter 64 bit territory, an Object header occupies 16 bytes instead of 12. Each Object reference now uses 8 bytes instead of 4. And do not forget of 8 bytes alignment. As a result, a Node occupies 16 + 3 * 8 = 40 bytes in 64 bit mode. An Integer occupies 16 + 4 = 20 bytes, which is aligned to 24 bytes.

As a result, each LinkedList element size grows from 40 to 64 bytes once you have switched from 32 to 64 bits.

Some memory tuning hints

As I have mentioned above, using JVM with over 32G heaps means a rather large performance penalty. Besides increased application footprint, JVM garbage collector will also have to deal with all these objects (add -XX:+PrintGCDetails option to your JVM to see the impact of garbage collection on your application).

I have already written about a few simple tricks you can apply to a not optimized application in order to reduce its memory footprint (in some cases memory gains could be quite huge, so do not ignore these advices!):

  • Your application may contain a large number of strings with equal contents. If you are on Java 7 and newer, you should consider string interning – it is an ultimate tool for getting rid of duplicate strings, but it should be used with caution – you should intern only medium to long living strings, which are likely to be duplicates. If you use Java 8 update 20 or newer, try using string deduplication – JVM will take care of string duplicates on its own (you must use G1 garbage collector for this feature!)
  • If you have a lot of numeric wrappers in heap, like Integer or Double, you are likely to keep them inside collections. There is no excuse in 2015 to avoid using primitive collections! I have recently written a large overview of hash maps in various primitive collection libraries. You may also take a look at my older Trove article.
  • Finally, look through a series of general memory consumption / memory saving articles I have written some time ago ( part 1, part 2, part 3, part 4).

Summary

  • Be careful when you increase your application heap size over 32G (from under 32G to over 32G) – JVM switches to 64 bit object references at that moment, which means that your application may end up with less available heap space. A rule of thumb is to jump from 32G right to 37-38G and continue adding memory from that point. The actual area of “grey” territory depends on your application – the bigger an average Java object in your application, the smaller is the overhead.
  • It may be wise to reduce your application memory footprint below 32G instead of dealing with a bigger heap. Look at my articles for some ideas: String interning, String deduplication, Hash maps and other primitive collections, Trove).

The post Going over Xmx32G heap boundary means you will have less memory available appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/over-32g-heap-java/feed/ 1
Performance of various general compression algorithms – some of them are unbelievably fast! http://java-performance.info/performance-general-compression/ http://java-performance.info/performance-general-compression/#comments Sat, 20 Dec 2014 06:18:20 +0000 http://java-performance.info/?p=822 by Mikhail Vorontsov 07 Jan 2015 update: extending LZ4 description (thanks to Mikael Grev for a hint!) This article will give you an overview of several general compression algorithm implementations performance. As it turned out, some of them could be used even when your CPU requirements are pretty strict. In this article we will compare: […]

The post Performance of various general compression algorithms – some of them are unbelievably fast! appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

07 Jan 2015 update: extending LZ4 description (thanks to Mikael Grev for a hint!)

This article will give you an overview of several general compression algorithm implementations performance. As it turned out, some of them could be used even when your CPU requirements are pretty strict.

In this article we will compare:

  • JDK GZIP – a slow algorithm with a good compression, which could be used for long term data compression. Implemented in JDK java.util.zip.GZIPInputStream / GZIPOutputStream.
  • JDK deflate – another algorithm available in JDK (it is used for zip files). Unlike GZIP, you can set compression level for this algorithm, which allows you to trade compression time for the output file size. Available levels are 0 (store, no compression), 1 (fastest compression) to 9 (slowest compression). Implemented as java.util.zip.DeflaterOutputStream / InflaterInputStream.
  • Java implementation of LZ4 compression algorithm – this is the fastest algorithm in this article with a compression level a bit worse than the fastest deflate. I advice you to read the wikipedia article about this algorithm to understand its usage. It is distributed under a friendly Apache license 2.0.
  • Snappy – a popular compressor developed in Google, which aims to be fast and provide relatively good compression. I have tested this implementation. It is also distributed under Apache license 2.0.


Compression test

I had to think a little what file set could be useful for data compression testing and at the same time could be present on most of Java developers machines (I don’t want to ask you to download hundreds of megabytes of files just to run the tests). Finally I realised that most of you have JDK javadoc installed locally. I decided to build a single file out of javadoc directory – concatenate all files. This can be easily done with tar, but not all of us are Linux users, so I have used the following class to generate such file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public class InputGenerator {
    private static final String JAVADOC_PATH = "your_path_to_JDK/docs";
    public static final File FILE_PATH = new File( "your_output_file_path" );
 
    static
    {
        try {
            if ( !FILE_PATH.exists() )
                makeJavadocFile();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    private static void makeJavadocFile() throws IOException {
        try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) )
        {
            appendDir(os, new File( JAVADOC_PATH ));
        }
        System.out.println( "Javadoc file created" );
    }
 
    private static void appendDir( final OutputStream os, final File root ) throws IOException {
        for ( File f : root.listFiles() )
        {
            if ( f.isDirectory() )
                appendDir( os, f );
            else
                Files.copy(f.toPath(), os);
        }
    }
}
public class InputGenerator {
    private static final String JAVADOC_PATH = "your_path_to_JDK/docs";
    public static final File FILE_PATH = new File( "your_output_file_path" );

    static
    {
        try {
            if ( !FILE_PATH.exists() )
                makeJavadocFile();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static void makeJavadocFile() throws IOException {
        try( OutputStream os = new BufferedOutputStream( new FileOutputStream( FILE_PATH ), 65536 ) )
        {
            appendDir(os, new File( JAVADOC_PATH ));
        }
        System.out.println( "Javadoc file created" );
    }

    private static void appendDir( final OutputStream os, final File root ) throws IOException {
        for ( File f : root.listFiles() )
        {
            if ( f.isDirectory() )
                appendDir( os, f );
            else
                Files.copy(f.toPath(), os);
        }
    }
}

The total file size on my machine is 354,509,602 bytes (338 Mb).

Testing

Initially I thought about reading the whole file into RAM and compressing it in RAM. It turned out that you may pretty easily run out of heap space on commodity 4G machines with such approach :(

Instead I decided to rely on the OS file cache. We will use JMH as a test framework. The file will be loaded in OS cache during warmup phase (we will run compression test twice during warmup). We will compress into ByteArrayOutputStream (I know, it is not the fastest solution, but it is consistent across all tests and it does not need to spend more time writing compressed data to disk), so you still need some RAM to keep the output in memory.

Here is a test base class. All tests differ only in the compressing output stream implementation, so they will create a stream in StreamFactory implementation and reuse the base class test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(1)
@Warmup(iterations = 2)
@Measurement(iterations = 3)
@BenchmarkMode(Mode.SingleShotTime)
public class TestParent {
    protected Path m_inputFile;
 
    @Setup
    public void setup()
    {
        m_inputFile = InputGenerator.FILE_PATH.toPath();
    }
 
    interface StreamFactory
    {
        public OutputStream getStream( final OutputStream underlyingStream ) throws IOException;
    }
 
    public int baseBenchmark( final StreamFactory factory ) throws IOException
    {
        ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length());
        try ( OutputStream os = factory.getStream( bos ) )
        {
            Files.copy(m_inputFile, os);
        }
        return bos.size();
    }
}
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
@Fork(1)
@Warmup(iterations = 2)
@Measurement(iterations = 3)
@BenchmarkMode(Mode.SingleShotTime)
public class TestParent {
    protected Path m_inputFile;

    @Setup
    public void setup()
    {
        m_inputFile = InputGenerator.FILE_PATH.toPath();
    }

    interface StreamFactory
    {
        public OutputStream getStream( final OutputStream underlyingStream ) throws IOException;
    }

    public int baseBenchmark( final StreamFactory factory ) throws IOException
    {
        ByteArrayOutputStream bos = new ByteArrayOutputStream((int) m_inputFile.toFile().length());
        try ( OutputStream os = factory.getStream( bos ) )
        {
            Files.copy(m_inputFile, os);
        }
        return bos.size();
    }
}

All tests look similar (you can find them in the source code at the end of this article), but here is an example – JDK deflate test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class JdkDeflateTest extends TestParent {
    @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"})
    public int m_lvl;
 
    @Benchmark
    public int deflate() throws IOException
    {
        return baseBenchmark(new StreamFactory() {
            @Override
            public OutputStream getStream(OutputStream underlyingStream) throws IOException {
                return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 );
            }
        });
    }
}
public class JdkDeflateTest extends TestParent {
    @Param({"1", "2", "3", "4", "5", "6", "7", "8", "9"})
    public int m_lvl;

    @Benchmark
    public int deflate() throws IOException
    {
        return baseBenchmark(new StreamFactory() {
            @Override
            public OutputStream getStream(OutputStream underlyingStream) throws IOException {
                return new DeflaterOutputStream( underlyingStream, new Deflater( m_lvl, true ), 512 );
            }
        });
    }
}

Test results

Output file sizes

First of all let’s see the output file sizes:

Implementation File size (bytes)
GZIP 64,214,683
Snappy (normal) 138,250,196
Snappy (framed) 101,470,113
LZ4 (fast 64K) 98,326,531
LZ4 (fast 128K) 94,403,752
LZ4 (fast double 64K) 94,478,009
LZ4 (fast 32M) 89,758,917
LZ4 (fast double 32M) 84,337,838
LZ4 (fast triple 32M) 83,426,446
LZ4 (high) 82,085,338
Deflate (lvl=1) 78,383,316
Deflate (lvl=2) 75,280,213
Deflate (lvl=3) 73,251,533
Deflate (lvl=4) 68,110,895
Deflate (lvl=5) 65,721,750
Deflate (lvl=6) 64,214,665
Deflate (lvl=7) 64,019,601
Deflate (lvl=8) 63,874,787
Deflate (lvl=9) 63,868,222

Output size

As you can see, the difference between the smallest and the biggest compressed files is pretty large (from 61 to 131 Mb). There are several LZ4 algorithm options in this table – I will cover it in more details closer to the end of this article. Let’s see how long did it take to compress for each implementation.

Compression time

Implementation Compression time (ms)
Snappy.framedOutput 2264.700
Snappy.normalOutput 2201.120
Lz4.testFastNative64K 1075.138
Lz4.testFastNative128K 1068.932
Lz4.testFastNativeDouble64K 1261.138
Lz4.testFastNative32M 1076.141
Lz4.testFastNativeDouble32M 1230.563
Lz4.testFastNativeTriple32M 1433.068
Lz4.testHighNative64K 6812.911
deflate (lvl=1) 4522.644
deflate (lvl=2) 4726.477
deflate (lvl=3) 5081.934
deflate (lvl=4) 6739.450
deflate (lvl=5) 7896.572
deflate (lvl=6) 9783.701
deflate (lvl=7) 10731.761
deflate (lvl=8) 14760.361
deflate (lvl=9) 14878.364
GZIP 10351.887

Compression time

Let’s merge compression time and file size on one diagram in order to calculate the throughput and make some conclusions.

Throughput and efficiency

Implementation Time (ms) Uncompressed file size (Mb) Throughput (Mb/sec) Compressed file size (Mb)
Snappy.normalOutput 2201.12 338 153.5581885586 131.8456611633
Snappy.framedOutput 2264.7 338 149.2471409017 96.7694406509
Lz4.testFastNative64K 1075.138 338 314.3782472576 93.771487236
Lz4.testFastNative128K 1068.932 338 316.2034628957 90.0304336548
Lz4.testFastNativeDouble64K 1261.138 338 268.0119067065 90.1012506485
Lz4.testFastNative32M 1076.141 338 314.0852360425 85.6007738113
Lz4.testFastNativeDouble32M 1230.563 338 274.6710245636 80.4308300018
Lz4.testFastNativeTriple32M 1433.068 338 235.8576145724 79.5616588593
Lz4.testHighNative64K 6812.9 338 49.6117659147 78.2826786041
deflate (lvl=1) 4522.644 338 74.7350443679 74.752155304
deflate (lvl=2) 4726.477 338 71.5120374012 71.7928056717
deflate (lvl=3) 5081.934 338 66.5101120951 69.8581056595
deflate (lvl=4) 6739.45 338 50.1524605124 64.9556112289
deflate (lvl=5) 7896.572 338 42.8033835442 62.6771450043
deflate (lvl=6) 9783.701 338 34.5472536415 61.2398767471
deflate (lvl=7) 10731.761 338 31.4952969974 61.0538492203
deflate (lvl=8) 14760.361 338 22.8991689295 60.9157438278
deflate (lvl=9) 14878.364 338 22.7175514727 60.9094829559
GZIP 10351.887 338 32.651051929 61.2398939133

Algorithm performance

Many of these implementations are pretty slow: ~23 Mb/sec for high level deflate or even ~33 Mb/sec for GZIP is not something you should be happy with on Xeon E5-2650. At the same time fastest deflate version is running at ~75 Mb/sec, Snappy at ~150 Mb/sec and LZ4 (fast, JNI) at truly surprising ~320 Mb/sec (actually much faster, but this time includes reading file from OS cache).

This diagram clearly shows that 2 implementations are not competitive at the moment: Snappy is slower than LZ4 (fast), but produces the bigger files. LZ4 (high) is in turn slower than deflate levels 1 to 4 and produces a bigger output compared to even deflate level=1.

As a result, I would probably choose between LZ4(fast) JNI implementation and deflate level=1 when I need an “on the fly compression”. You may have to use deflate if your organization does not allow 3rd party libraries. You should consider how much spare CPU cycles you have on your box as well as where the compressed data is being sent. For example, if you are writing compressed data directly to HDD, then performance above ~100 Mb/sec would not help you (provided that your file is large enough) – HDD speed will become the bottleneck. Same output written into a modern SSD – even LZ4 would not be fast enough :) If you compress your data prior to sending it over a Gigabit network, you should probably use LZ4, because 75 Mb/sec of deflate performance is considerably less than 125 Mb/sec of network throughput (yes, I know about packet headers, but the difference will be still considerable).

LZ4 compression algorithm

LZ4 is an algorithm which encodes data in frames. Each frame contains a header and compressed data. The size of the compression buffer (amount of data which will be compressed in one frame) is an LZ4BlockOutputStream constructor argument:

1
public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)
public LZ4BlockOutputStream(OutputStream out, int blockSize, LZ4Compressor compressor)

Current implementation allows the block size to be between 64 bytes and 32 Mb. Obviously, the bigger frame you will use, the higher will be the compression ratio. You should keep in mind that LZ4BlockOutputStream will allocate identically sized uncompression buffer (this info is stored in the frame header).

As you have seen above, there is very little difference in the time required to compress the data with either 64K or 32M buffer, which means you should try using the bigger buffer in order to obtain some extra compression.

Another interesting LZ4 property (thanks to Mikael Grev for idea) is that it makes sense to use 2 LZ4BlockOutputStream-s in a row, because subsequent blocks may contain similarly encoded data. The performance penalty is pretty unnoticeable, but you can gain extra compression (in case of 32M buffer, the output was 89M for the single pass and 84 for the double pass at a tiny cost of ~200 ms for 89M of data output from the first pass). It does not make much sense to make three or more passes – you will have very little compression improvement.

At the same time, it makes more sense to double the buffer size on the single pass rather than make 2 passes on the smaller buffers (the exception are buffers over 16M, which will allow you to circumvent 32M compression buffer limitation) – as you can see, you get nearly identical file size for double 64K pass and for single 128K pass. Double pass, as you have seen, will always be slower.

See also

Benchmark suite for data compression library on the JVM – a comprehensive test suite for data compressors implemented in Java or accessible via JNI. It tests time and space efficiency using several sets of data to compress. Thanks to Sam Van Oort for a link!

Summary

  • If you think that data compression is painfully slow, then check LZ4 (fast) implementation, which is able to compress a text file at ~320 Mb/sec – compression at such speed should be not noticeable for most of applications. It makes sense to increase the LZ4 compression buffer size up to its 32M limit if possible (keep in mind that you will need a similarly sized buffer for uncompression). You can also try chaining 2 LZ4BlockOutputStream-s with 32M buffer size to get most out of LZ4.
  • If you are restricted from using 3rd party libraries or want a little bit better compression, check JDK deflate (lvl=1) codec – it was able to compress the same file at ~75 Mb/sec.

Source code

Java compression test project source code

Use standard JMH approach to run this project:

mvn clean install
java -jar target/benchmarks.jar

The post Performance of various general compression algorithms – some of them are unbelievably fast! appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/performance-general-compression/feed/ 3
Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove http://java-performance.info/large-hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove/ http://java-performance.info/large-hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove/#comments Sat, 01 Nov 2014 12:41:26 +0000 http://java-performance.info/?p=798 by Mikhail Vorontsov This article is outdated! A newer version covering the latest versions of collections libraries is available here. 04 Jan 2015 update: a couple of clarifications, fixed a bug in FastUtil Object-int test – now it got much faster (thanks to Sebastiano Vigna for his suggestions). Introduction This article will give you an […]

The post Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov


This article is outdated! A newer version covering the latest versions of collections libraries is available here.












04 Jan 2015 update: a couple of clarifications, fixed a bug in FastUtil Object-int test – now it got much faster (thanks to Sebastiano Vigna for his suggestions).

Introduction

This article will give you an overview of hash map implementations in 5 well known libraries and JDK HashMap as a baseline. We will test separately:

  • Primitive to primitive maps
  • Primitive to object maps
  • Object to primitive maps
  • Object to Object maps (JDK participates only in this section)

This article will overview a single test – map read access for a random set of keys (a set of keys is shared for all collections of a given capacity).

We will also pay attention to the way the data is stored inside these collections and to some pretty interesting implementation details.

Participants

JDK 8

JDK HashMap is the oldest hash map implementation in this test. It got a couple of major updates recently – a shared underlying storage for the empty maps in Java 7u40 and a possibility to convert underlying hash bucket linked lists into tree maps (for better worse case performance) in Java 8.

FastUtil 6.5.15

FastUtil provides a developer a set of all 4 options listed above (all combinations of primitives and objects). Besides that, there are several other types of maps available for each parameter type combination: array map, AVL tree map and RB tree map. Nevertheless, we are only interested in hash maps in this article.

Goldman Sachs Collections 5.1.0

Goldman Sachs has open sourced its collections library about 3 years ago. In my opinion, this library provides the widest range of collections out of box (if you need them). You should definitely pay attention to it if you need more than a hash map, tree map and a list for your work :) For the purposes of this article, GS collections provide a normal, synchronized and unmodifiable versions of each hash map. The last 2 are just facades for the normal map, so they don’t provide any performance advantages.

HPPC 0.6.1

HPPC provides array lists, array dequeues, hash sets and hash maps for all primitive types. HPPC provides normal hash maps for primitive keys and both normal and identity hash maps for object keys.

Koloboke 0.6

Koloboke is the youngest of all libraries in this article. It is developed as a part of an OpenHFT project by Roman Leventov. This library currently provides hash maps and hash sets for all primitive/object combinations. This library was recently renamed from HFTC, so some artifacts in my tests will still use the old library name.

Trove 3.0.3

Trove is available for a long time and quite stable. Unfortunately, not much development is happening in this project at the moment. Trove provides you the list, stack, queue, hash set and map implementations for all primitive/object combinations. I have already written about Trove.

Data storage implementations and tests

This article will look at 4 different sorts of maps:

  1. intint
  2. intInteger
  3. Integerint
  4. IntegerInteger

Let’s see how the data is stored in each kind of those maps. We will refer to the test names instead of the actual implementation names, because a lot of those implementations are called very similarly and it’s not easy to distinguish them by name. After looking at the implementation details, we will check how they affect the actual test results.

We will use JMH 1.0 for testing. Here is the test description: for each map size in (10K, 100K, 1M, 10M, 100M) (outer loop) generate a set of random keys (they will be used for each test at a given map size) and then run a test for each map implementations (inner loop). Each test will be run 100M / map_size times (so that we will call map.get 100M times for each test case).

  1. In setup: Take a set of int keys and required fill factor
  2. Initialize a map with a given fill factor and capacity = number of keys
  3. Populate a map with keys and values = keys
  4. Store a reference to the keys array or convert it into Integer[] for tests with object keys (nevertheless, use the same keys)
  5. All tests are nearly identical – get stored values for an array of keys and use these values, so that JVM will not optimize out your code:

    1
    2
    3
    4
    5
    6
    
    public int runRandomTest() {
        int res = 0;
        for ( int i = 0; i < m_keys.length; ++i )
            res = res ^ m_map.get( m_keys[ i ] );
        return res;
    }
    public int runRandomTest() {
        int res = 0;
        for ( int i = 0; i < m_keys.length; ++i )
            res = res ^ m_map.get( m_keys[ i ] );
        return res;
    }

int-int

tests.maptests.primitive.FastUtilMapTest int[] keys, int[] values, boolean[] used
tests.maptests.primitive.GsMutableMapTest int[] keys, int[] values
tests.maptests.primitive.HftcMutableMapTest long[] (key-low bits, value-high bits)
tests.maptests.primitive.HppcMapTest int[] keys, int[] values, boolean[] allocated
tests.maptests.primitive.TroveMapTest int[] _set, int[] _values, byte[] _states

As you can see, FastUtil, HPPC and Trove use identical storage, so you may expect the similar performance from them.

Handling of empty and removed cells in GS collections and Koloboke

GS collections use just keys and values arrays. If you have ever looked at the hash map implementations, you should know that a map should at least distinguish empty cells from the occupied ones (some maps also use "removed cell" marker). How could you achieve such functionality without extra storage? GS IntIntHashMap uses a companion sentinel object containing values for key=0 (empty cell) and key=1 (removed key). All operations on keys=0 or 1 are done on the sentinel object. Such an object allows GS IntIntHashMap to use O(1) storage for flags instead of O(capacity). This also allows you to access only 2 cells of memory instead of 3, which makes this implementation faster.

Koloboke int-int map (the actual name is hidden behind the factories and may change) is going even further. First of all, in some cases it uses an array of longer datatype as storage, which is capable to keep both key and value in one element. int-int map is an example of such approach: a key is stored in the low 32 bits of a long cell and a value is stored in the high 32 bits. Such a layout means only one cache line miss in case of the cold data access instead of 2 (GS collections) or 3 (all other).

Koloboke uses a different technique for marking non-used entries. When a map is initialized, it picks a random int and uses it as a free cell marker. If you try to insert a key = free cell marker, it picks another random value, which is not present in the map and so on. It means that Koloboke uses just 4 bytes overhead for handling empty nodes and does it in the extremely efficient way.

In general such approach does not impose any performance penalties unless your map size is getting close to the number of values in a given datatype. You may want to think what will happen in case of smaller key data types? You will get a HashOverflowException defined in koloboke-api library if you will attempt to add all datatype values into a map. You can use the following test to reproduce it:

1
2
3
4
5
6
7
8
HashByteIntMap m = HashByteIntMaps.newMutableMap( 256 );
for ( int i = Byte.MIN_VALUE; i < Byte.MAX_VALUE; ++i )
{
    final byte key = (byte) i;
    m.put( key, i );
}
m.put( Byte.MAX_VALUE, 127 );   //exception will be thrown here
System.out.println( m.size() );
HashByteIntMap m = HashByteIntMaps.newMutableMap( 256 );
for ( int i = Byte.MIN_VALUE; i < Byte.MAX_VALUE; ++i )
{
    final byte key = (byte) i;
    m.put( key, i );
}
m.put( Byte.MAX_VALUE, 127 );   //exception will be thrown here
System.out.println( m.size() );

Nevertheless, this should not be an issue in the real life. If you want to map every / most of byte/char/short into some value, you'd better use an array of value type indexed by keys.

int-int Test results

Each of test sections will start with a result table followed by a chart. The first line in a table is a map size. All test results are in milliseconds.

  10000 100000 1000000 10000000 100000000
tests.maptests.primitive.HftcMutableMapTest 955 1324 1871 4198 3805
tests.maptests.primitive.HftcImmutableMapTest 941 1335 1807 4194 3793
tests.maptests.primitive.HftcUpdateableMapTest 949 1314 1836 4183 3799
tests.maptests.primitive.GsMutableMapTest 977 1883 3322 6256 7754
tests.maptests.primitive.GsImmutableMapTest 997 1895 3279 6201 7786
tests.maptests.primitive.FastUtilMapTest 1045 1590 3776 7655 10095
tests.maptests.primitive.HppcMapTest 1021 1580 3693 7612 10086
tests.maptests.primitive.TroveMapTest 1775 2642 5137 10799 13834

int-int test results

As you can see, libraries got split into 4 distinctly different groups (fastest to slowest):

  1. Koloboke shows the best results: using a single long[] for storage and a clever trick of a random free cell values gives its results. All 3 versions of Koloboke collections are showing exactly the same result in this test (it does not mean they will be equally fast in other tests as well).
  2. GS collections implementation is the second fastest - using 2 arrays instead of 3 as well as good code quality pays off here.
  3. FastUtil and HPPC are showing exactly the same performance (less than 2% difference).
  4. Trove is the the slowest implementation in this test, being about 2 times slower than Koloboke on most of map sizes, but becoming even more slower on huge maps sizes (10M+).

Note that Koloboke works faster on 100M map rather than on 10M map. According to Roman Leventov email, this happens due to bigger fill factor chosen for a map(size=10M) than for a map(size=100M). You will see the similar difference in Object-Object test results.

int-Object

tests.maptests.prim_object.FastUtilIntObjectMapTest int[] key, Object[] value, boolean[] used
tests.maptests.prim_object.GsIntObjectMapTest int[] keys, Object[] values
tests.maptests.prim_object.HftcIntObjectMapTest int[] keys, Object[] values
tests.maptests.prim_object.HppcIntObjectMapTest int[] keys, Object[] values, boolean[] allocated
tests.maptests.prim_object.TroveIntObjectMapTest int[] _set, Object[] _values, byte[] _states

No surprises here: FastUtil, HPPC and Trove are using 3 arrays (including an array of cell states). GS collections and Koloboke are using 2 arrays and the tricks similar to the listed above for the special cases.

int-Object test results

  10000 100000 1000000 10000000 100000000
tests.maptests.prim_object.HftcIntObjectMapTest 1223 1358 3034 6187 7064
tests.maptests.prim_object.FastUtilIntObjectMapTest 1213 1746 4112 7902 10595
tests.maptests.prim_object.GsIntObjectMapTest 1764 2658 4310 7775 9715
tests.maptests.prim_object.HppcIntObjectMapTest 1666 1725 4083 8447 12202
tests.maptests.prim_object.TroveIntObjectMapTest 1987 2835 5812 11269 14265

int-object test results

There are 3 groups in this test (fastest to slowest):

  1. Koloboke is the fastest one due to using only 2 arrays and simpler code for the empty cells case.
  2. It is followed by GS collections (which did not manage to use the advantage of 2 storage arrays instead of 3), FastUtil and HPPC. Their results slightly vary in different tests, but they are relatively close to each other.
  3. Trove is the slowest again, losing 1.5 to 2 times to Koloboke.

Object-int

tests.maptests.object_prim.FastUtilObjectIntMapTest Object[] key, int[] value, boolean[] used
tests.maptests.object_prim.GsObjectIntMapTest Object[] keys, int[] values
tests.maptests.object_prim.HftcObjectIntMapTest Object[] keys, int[] values
tests.maptests.object_prim.HppcObjectIntMapTest Object[] keys, int[] values, boolean[] allocated
tests.maptests.object_prim.TroveObjectIntMapTest Object[] _set, int[] _values

FastUtil and HPPC are using the third array in case of Object keys. This seems to be a bad idea, because you can always use a private sentinel object as a flag in case of Object keys. We will see the actual performance a bit below.

GS collections, Koloboke and Trove are using 2 arrays, so we should expect them to be a little faster.

Object-int test results

  10000 100000 1000000 10000000 100000000
tests.maptests.object_prim.HftcObjectIntMapTest 1775 1781 4320 8567 8962
tests.maptests.object_prim.GsObjectIntMapTest 1598 2876 6214 8467 11700
tests.maptests.object_prim.FastUtilObjectIntMapTest 1599 2614 6151 9273 15146
tests.maptests.object_prim.HppcObjectIntMapTest 2297 2687 6077 10788 17425
tests.maptests.object_prim.TroveObjectIntMapTest 2550 3286 5837 11804 14324

object-int test results

There are 2 groups in this test, though the groups are not that distinctive as before (fastest to slowest):

  1. Koloboke is faster than other implementations with the exceptions of 10K map, where it is slower than both GS collections and FastUtil and 10M, where it is slower than GS collections (yeah, the same problem with too big fill factor which was mentioned above).
  2. Other collections behave similarly to each other until map size = 1M. After that we can see that GS collections are getting faster than others, and it is followed by FastUtil.

Object-Object

tests.maptests.object.FastUtilObjMapTest Object[] keys, Object[] values, boolean[] used
tests.maptests.object.GsObjMapTest Object[] table - interleaved keys and values
tests.maptests.object.HftcMutableObjTest Object[] tab - interleaved keys and values
tests.maptests.object.HppcObjMapTest Object[] keys, Object[] values, boolean[] allocated
tests.maptests.object.JdkMapTest Node<K,V>[] table - each Node could be a part of a linked list or a TreeMap (Java 8)
tests.maptests.object.TroveObjMapTest Object[] _set, Object[] _values

In case of Object-to-Object mappings we have a more complex picture:

  • FastUtil and HPPC are using 3 arrays per map. Nothing fancy.
  • JDK HashMap is the only map which stores entries in the Node objects, which combine a key and a value. It means you have at least 24 bytes of overhead per entry. The actual overhead are 32 bytes because each bucket in a HashMap is a double linked list, so each entry has 2 extra pointers.
  • Trove is using 2 maps (and a special sentinel object for empty cells).
  • Finally, GS collections and Koloboke are using a single array with interleaved keys and values, which makes them most CPU cache friendly collections of these 6.

Now, armed with the implementation knowledge, let's test the maps performance.

Object-Object test results

  10000 100000 1000000 10000000 100000000
tests.maptests.object.HftcMutableObjTest 1146 1378 2928 6215 5945
tests.maptests.object.JdkMapTest 1151 1776 3759 5341 11523
tests.maptests.object.GsObjMapTest 1566 2242 4582 6012 8110
tests.maptests.object.FastUtilObjMapTest 1720 3002 6015 9360 13292
tests.maptests.object.HppcObjMapTest 1726 3085 5692 9125 13139
tests.maptests.object.TroveObjMapTest 2065 2979 5713 10266 12631

object-object test results

This test results are even less clear.

  1. There is Koloboke which is generally faster than JDK HashMap, but the difference is not that big except the case of huge maps, where Koloboke wins.
  2. GS collections is close to Koloboke and JDK on the large and huge maps, but sufficiently far in case of smaller maps.
  3. Finally there FastUtil, HPPC and Trove with approximately the same performance for all map sizes.

One billion entries test

I decided to see what will happen to these collections if I will try to create a map with a requested size of one billion entries and fill factor = 0.5, which means that all these maps will have to allocate an array very close to the maximal allowed array length = 231.

FastUtil, HPPC and GS collections have failed with various exceptions (not OOM - I have allocated 110G RAM for this test).

Koloboke, Trove and JDK managed to pass these tests. Unfortunately, I dod not manage to run some of these tests successfully in JMH, so they were run by a separate code.

Here are the test results (if you want to compare them to the previous results, multiply the previous results by 10, because all previous tests called map.get 100M times in total):

tests.maptests.primitive.HftcMutableMapTest : time = 95.05 sec
tests.maptests.primitive.TroveMapTest : time = 235.062 sec

tests.maptests.prim_object.HftcIntObjectMapTest : time = 216.361 sec
tests.maptests.prim_object.TroveIntObjectMapTest : time = 304.019 sec

tests.maptests.object_prim.HftcObjectIntMapTest : time = 335.139 sec
tests.maptests.object_prim.TroveObjectIntMapTest : time = 217.412 sec

tests.maptests.object.HftcMutableObjTest : time = 272.792 sec
tests.maptests.object.JdkMapTest : time = 163.335 sec
tests.maptests.object.TroveObjMapTest : time = 239.133 sec

As you can see, Koloboke wins by a large margin in the primitive-to-primitive test. It is also significantly faster in primitive-to-object test.

In case of object-to-primitive test Koloboke took significantly longer than Trove to complete.

Finally, for object-to-object test, I had to change Koloboke map initialization code, because by default it started to degrade extremely quickly once I have added half a billion elements into it:

1
HashObjObjMaps.getDefaultFactory().withHashConfig(HashConfig.fromLoads(0.5, 0.6, 0.8)).newMutableMap(keys.length)
HashObjObjMaps.getDefaultFactory().withHashConfig(HashConfig.fromLoads(0.5, 0.6, 0.8)).newMutableMap(keys.length)

Koloboke 2.0?

Roman Leventov has just announced that he is considering to implement a newer and even faster version of Koloboke library, but he needs your feedback. Do you mind to write him a line?

Summary

  • Koloboke has turned out to be the fastest and the most memory efficient library implementing hash maps. This library is too young and not widely used yet, but why don't give it a try?
  • If you are looking for a more stable and mature library (and willing to sacrifice some performance), you should probably look at GS collections library. Unlike Koloboke, it gives you a wide range of collections out of box.

Source code

The article source code is now hosted at GitHub: https://github.com/mikvor/hashmapTest. You may expect that the test set would be slightly ahead of this article :)

Please note you should run this project via tests.MapTestRunner class:

mvn clean install
java -cp target/benchmarks.jar tests.MapTestRunner

The post Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/large-hashmap-overview-jdk-fastutil-goldman-sachs-hppc-koloboke-trove/feed/ 3
Introduction to JMH http://java-performance.info/jmh/ http://java-performance.info/jmh/#comments Sat, 13 Sep 2014 05:02:48 +0000 http://java-performance.info/?p=750 by Mikhail Vorontsov 11 Sep 2014: Article was updated for JMH 1.0. 10 May 2014: Original version. Introduction This article will give you an overview of basic rules and abilities of JMH. The second article will give you an overview of JMH profilers. JMH is a new microbenchmarking framework (first released late-2013). Its distinctive advantage […]

The post Introduction to JMH appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

11 Sep 2014: Article was updated for JMH 1.0.

10 May 2014: Original version.

Introduction

This article will give you an overview of basic rules and abilities of JMH. The second article will give you an overview of JMH profilers.

JMH is a new microbenchmarking framework (first released late-2013). Its distinctive advantage over other frameworks is that it is developed by the same guys in Oracle who implement the JIT. In particular I want to mention Aleksey Shipilev and his brilliant blog. JMH is likely to be in sync with the latest Oracle JRE changes, which makes its results very reliable.

You can find JMH examples here.

JMH has only 2 requirements (everything else are recommendations):

  • You need to create a maven project using a command from the JMH official web page
  • You need to annotate test methods with @Benchmark annotation

In some cases, it is not convenient to create a new project just for the performance testing purposes. In this situation you can rather easily add JMH into an existing project. You need to make the following steps:

  1. Ensure your project directory structure is recognizable by Maven (your benchmarks are at src/main/java at least)
  2. Copy 2 JMH maven dependencies and maven-shade-plugin from the JMH archetype. No other plugins mentioned in the archetype are required at the moment of writing (JMH 1.0).

How to run

Run the following maven command to create a template JMH project from an archetype (it may change over the time, check for the latest version near the start of the the official JMH page):

$ mvn archetype:generate \
          -DinteractiveMode=false \
          -DarchetypeGroupId=org.openjdk.jmh \
          -DarchetypeArtifactId=jmh-java-benchmark-archetype \
          -DgroupId=org.sample \
          -DartifactId=test \
          -Dversion=1.0

Alternatively, copy 2 JMH dependencies and maven-shade-plugin from the JMH archetype (as described above).

Create one (or a few) java files. Annotate some methods in them with @Benchmark annotation – these would be your performance benchmarks.

You have at least 2 simple options to run your tests::

Follow the procedure from the official JMH page):

$ cd your_project_directory/
$ mvn clean install
$ java -jar target/benchmarks.jar

The last command should be entered verbatim – regardless of your project settings you will end up with target/benchmarks.jar sufficient to run all your tests. This option has a slight disadvantage – it will use the default JMH settings for all settings not provided via annotations ( @Fork, @Warmup and @Measurement annotations are getting nearly mandatory in this mode). Use java -jar target/benchmarks.jar -h command to see all available command line options (there are plenty).

Or use the old way: add main method to some of your classes and write a JMH start script inside it. Here is an example:

1
2
3
4
5
Options opt = new OptionsBuilder()
                .include(".*" + YourClass.class.getSimpleName() + ".*")
                .forks(1)
                .build();
new Runner(opt).run();
Options opt = new OptionsBuilder()
                .include(".*" + YourClass.class.getSimpleName() + ".*")
                .forks(1)
                .build();
new Runner(opt).run();

After that you can run it with target/benchmarks.jar as your classpath:

$ cd your_project_directory/
$ mvn clean install
$ java -cp target/benchmarks.jar your.test.ClassName

Now after extensive “how to run it” manual, let’s look at the framework itself.


Test modes

You can use the following test modes specified using @BenchmarkMode annotation on the test methods:

Name Description
Mode.Throughput Calculate number of operations in a time unit.
Mode.AverageTime Calculate an average running time.
Mode.SampleTime Calculate how long does it take for a method to run (including percentiles).
Mode.SingleShotTime Just run a method once (useful for cold-testing mode). Or more than once if you have specified a batch size for your iterations (see @Measurement annotation below) – in this case JMH will calculate the batch running time (total time for all invocations in a batch).
Any set of these modes You can specify any set of these modes – the test will be run several times (depending on number of requested modes).
Mode.All All these modes one after another.

Time units

You can specify time unit to use via @OutputTimeUnit, which requires an argument of the standard Java type java.util.concurrent.TimeUnit. Unfortunately, if you have specified several test modes for one test, the given time unit will be used for all tests (for example, it may be convenient to measure SampleTime in nanoseconds, but throughput should better be measured in the longer time units).

State of test arguments

Your test methods can accept arguments. You could provide a single argument of a class which complies to 4 following rules:

  • There should be a no-arg constructor (default constructor).
  • It should be a public class.
  • Inner classes should be static.
  • Class must be annotated with @State annotation.

@State annotation defines the scope in which an instance of a given class will be available. JMH allows you to run tests in multiple threads simultaneously, so choose the right state:

Name Description
Scope.Thread This is a default state. An instance will be allocated for each thread running the given test.
Scope.Benchmark An instance will be shared across all threads running the same test. Could be used to test multithreaded performance of a state object (or just mark your benchmark with this scope).
Scope.Group An instance will be allocated per thread group (see Groups section down below).

Besides marking a separate class as a @State, you can also mark your own benchmark class as @State. All above scope rules apply to this case as well.

State housekeeping

Like JUnit tests, you can annotate your state class methods with @Setup and @TearDown annotations (these methods called fixtures in JMH documentation. You can have any number of setup/teardown methods. These methods do not contribute anything to test times (but Level.Invocation may affect precision of measurements).

You can specify when to call fixtures by providing a Level argument for @Setup/@TearDown annotations:

Name Description
Level.Trial This is a default level. Before/after entire benchmark run (group of iteration)
Level.Iteration Before/after an iteration (group of invocations)
Level.Invocation Before/after every method call (this level is not recommended until you know what you are doing)

Dead code

Dead code elimination is a well known problem among microbenchmark writers. The general solution is to use the result of calculations somehow. JMH does not do any magic tricks on its own. If you want to defend against dead code elimination – never write void tests. Always return the result of your calculations. JMH will take care of the rest.

If you need to return more than one value from your test, either combine all return values with some cheap operation (cheap compared to the cost of operations by which you got your results) or use a BlackHole method argument and sink all your results into it (note that BlockHole.consume may be more expensive than manual combining of results in some cases). BlackHole is a thread-scoped class:

1
2
3
4
5
6
@Benchmark
public void testSomething( BlackHole bh )
{
    bh.consume( Math.sin( state_field ));
    bh.consume( Math.cos( state_field ));
}
@Benchmark
public void testSomething( BlackHole bh )
{
    bh.consume( Math.sin( state_field ));
    bh.consume( Math.cos( state_field ));
}

Constant folding

If result of your calculation is predictable and does not depend on state objects, it is likely to be optimized by JIT. So, always read the test input from a state object and return the result of your calculations. This rule is mostly related to the case of a single return value. Using BlackHole object makes it much harder for JVM to optimize it (but not impossible!). Both methods in the following test will not be optimized.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
private double x = Math.PI;
 
@Benchmark
public void bhNotQuiteRight( BlackHole bh )
{
    bh.consume( Math.sin( Math.PI ));
    bh.consume( Math.cos( Math.PI ));
}
 
@Benchmark
public void bhRight( BlackHole bh )
{
    bh.consume( Math.sin( x ));
    bh.consume( Math.cos( x ));
}
private double x = Math.PI;

@Benchmark
public void bhNotQuiteRight( BlackHole bh )
{
    bh.consume( Math.sin( Math.PI ));
    bh.consume( Math.cos( Math.PI ));
}

@Benchmark
public void bhRight( BlackHole bh )
{
    bh.consume( Math.sin( x ));
    bh.consume( Math.cos( x ));
}

Things are getting more complicated in case of a method returning a single value. The following tests will not be optimized, but if you will replace Math.sin with Math.log, then testWrong method will be replaced with a constant value:

1
2
3
4
5
6
7
8
9
10
11
12
13
private double x = Math.PI;
 
@Benchmark
public double testWrong()
{
    return Math.sin( Math.PI );
}
 
@Benchmark
public double testRight()
{
    return Math.sin( x );
}
private double x = Math.PI;

@Benchmark
public double testWrong()
{
    return Math.sin( Math.PI );
}

@Benchmark
public double testRight()
{
    return Math.sin( x );
}

So, in order to make your tests reliable, stick to the following rule: always read the test input from a state object and return the result of your calculations.

Loops


Do not use loops in your tests. JIT is too smart and often does magic tricks with loops. Test the actual calculation and let JMH to take care of the rest.

In case of non-uniform cost operations (for example, you test time to process a list which grows after each test) you may want to use @BenchmarkMode(Mode.SingleShotTime) with @Measurement(batchSize = N). But you must not implement test loops yourself!

Forks

By default JHM forks a new java process for each trial (set of iterations). This is required to defend the test from previously collected “profiles” – information about other loaded classes and their execution information. For example, if you have 2 classes implementing the same interface and test the performance of both of them, then the first implementation (in order of testing) is likely to be faster than the second one (in the same JVM), because JIT replaces direct method calls to the first implementation with interface method calls after discovering the second implementation.

So, do not set forks to zero until you know what you are doing.

In the rare cases when you need to specify number of forked JVMs, use @Fork test method annotation, which allows you to set number of forks, number of warmup iterations and the (extra) arguments for the forked JVM(s).

It may be useful to specify the forked JVM arguments via JMH API calls – it may allow you to provide JVM some -XX: arguments, which are not accessible via JMH API. It will allow you to automatically choose the best JVM settings for your critical code (remember that new Runner(opt).run() returns all test results in a convenient form).

Compiler hints

You can give the JIT a hint how to use any method in your test program. By “any method” I mean any method – not just those annotated by @Benchmark. You can use following @CompilerControl modes (there are more, but I am not sure about their usefulness):

Name Description
CompilerControl.Mode.DONT_INLINE This method should not be inlined. Useful to measure the method call cost and to evaluate if it worth to increase the inline threshold for the JVM.
CompilerControl.Mode.INLINE Ask the compiler to inline this method. Usually should be used in conjunction with Mode.DONT_INLINE to check pros and cons of inlining.
CompilerControl.Mode.EXCLUDE Do not compile this method – interpret it instead. Useful in holy wars as an argument how good is the JIT :)

Test control annotations

You can specify JMH parameters via annotations. These annotations could be applied to either classes or methods. Method annotations always win.

Name Description
@Fork Number of trials (sets of iterations) to run. Each trial is started in a separate JVM. It also lets you specify the (extra) JVM arguments.
@Measurement Allows you to provide the actual test phase parameters. You can specify number of iterations, how long to run each iteration and number of test invocations in the iteration (usually used with @BenchmarkMode(Mode.SingleShotTime) to measure the cost of a group of operations – instead of using loops).
@Warmup Same as @Measurement, but for warmup phase.
@Threads Number of threads to use for the test. The default is Runtime.getRuntime().availableProcessors().

CPU burning

From time to time you may want to burn some CPU cycles inside your tests. This could be done via a static BlackHole.consumeCPU(tokens) method. Token is a few CPU instructions. Method code is written so that the time to run this method will depend linearly on its argument (defensive against any JIT/CPU optimizations).

Running a test with a set of parameters

In many situations you need to test your code with several sets of parameters. Luckily, JMH does not force you to write N test methods if you need to test N sets of parameters. Or, to be more precise, JMH will help you if your test parameters are primitives, primitive wrappers or Strings.

All you need to do is:

  1. Define a @State object
  2. Define all your parameters fields in it
  3. Annotate each of these fields with @Param annotation

@Param annotation expects an array of String arguments. These strings will be converted to the field type before any @Setup method invocations. Nevertheless, JMH documentation claims that these field values may not be accessible in @Setup methods.

JMH will use an outer product of all @Param fields. So, if you have 2 parameters on the first field and 5 parameters on the second field, your test will be executed 2 * 5 * Forks times.

Thread groups – non uniform multithreading

We have already mentioned that @State(Scope.Benchmark) annotation could be used to test the case of multithreaded access to the state object. The degree of concurrency will be set by the number of threads which should be used for testing.

You may also need to define the non-uniform access to your state object – for example to test the “readers-writers” scenario where the number of readers is usually higher than the number of writers. JMH uses the notion of thread groups for this case.

In order to setup a group of tests, you need:

  1. Mark all your test methods with @Group(name) annotation, providing the same string name for all tests in a group (otherwise these tests will be run independently – no warning will be given!).
  2. Annotate each of your tests with @GroupThreads(threadsNumber) annotation, specifying a number of threads which will run the given method.

JMH will start a sum of all your @GroupThreads for the given group and will run all tests in a group concurrently in the same trial. The results will be given for the group and for each method independently.

Multithreading – False shared field access

You probably know about the fact that most modern x86 CPUs have 64 byte cache lines. CPU cache allows you to read data at great rates, but at the same time it creates a performance bottleneck if you have to read and write 2 adjacent fields from 2 or more threads at the same time. Such event is called “false sharing” – while fields seem to be accessed independently, they actually contend with each other on the hardware level.

The general solution to this problem is to pad such fields with at least 128 bytes of dummy data on both sides. Padding inside the same class may not work properly because JVM is allowed to reorder class fields in any order.

The more robust solution is to use class hierarchies – JVM usually puts all fields which belong to the same class together. For example, we can define class A with a read access field, extend it with a class B defining 16 long fields, extend class B with class C defining a write access field and finally (that’s important) extend class C with class D defining another 16 long fields – this will prevent contended access to a write variable from the object which will be located next in memory.

In case when read and write fields have the same type, you can also use a sparse array with 2 cells located far enough from each other. Do not use arrays as padding in the previous case – they are a special type of object and will contribute only 4 or 8 bytes (depending on your JVM settings) to padding.

There is another way to solve this problem if you are already using Java 8: use @sun.misc.Contended annotation for write fields and use -XX:-RestrictContended JVM key. For more details, take a look at Aleksey Shipilev’s presentation.

How JMH can help you with contended field access? It pads your @State objects from both sides, but it can not help you to pad individual fields inside a single object – this is left to yourself.

Summary

  • JMH is useful for all sorts of microbenchmarking – from nanoseconds to seconds per test. It takes care of all measurement logic, leaving you just a task of writing the test method(s). JMH also contains built-in support for all sorts of multithreaded tests – both uniform (all threads run the same code) and non-uniform (there are several groups of threads, each of them is running each own code).
  • If you have to remember just one JMH rule, it should be: always read test input from @State objects and return the result of your calculations (either explicitly or via a BlackHole object).
  • JMH is started differently since JMH 0.5: now you have to add one more dependency to your pom file and use maven-shade-plugin. It generates target/benchmarks.jar file, which contains all the code required to run all tests in your project.

The post Introduction to JMH appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/jmh/feed/ 12
String deduplication feature (from Java 8 update 20) http://java-performance.info/java-string-deduplication/ http://java-performance.info/java-string-deduplication/#comments Wed, 03 Sep 2014 09:31:23 +0000 http://java-performance.info/?p=788 by Mikhail Vorontsov This article will provide you a short overview of a string deduplication feature added into Java 8 update 20. String objects consume a large amount of memory in an average application. Some of these strings may be duplicated – there exist several distinct instances of the same String (a != b, but […]

The post String deduplication feature (from Java 8 update 20) appeared first on Java Performance Tuning Guide.

]]>
by Mikhail Vorontsov

This article will provide you a short overview of a string deduplication feature added into Java 8 update 20.

String objects consume a large amount of memory in an average application. Some of these strings may be duplicated – there exist several distinct instances of the same String (a != b, but a.equals(b)). In practice, a lot of Strings could be duplicated due to various reasons.

Originally, JDK offered String.intern() method to deal with the string duplication. The disadvantage of this method is that you have to find which strings should be interned. This generally requires a heap analysis tool with a duplicate string lookup ability, like YourKit profiler. Nevertheless, if used properly, string interning is a powerful memory saving tool – it allows you to reuse the whole String objects (each of whose is adding 24 bytes overhead to the underlying char[]).

Starting from Java 7 update 6, each String object has its own private underlying char[]. This allows JVM to make an automatic optimization – if an underlying char[] is never exposed to the client, then JVM can find 2 strings with the same contents, and replace the underlying char[] of one string with an underlying char[] of another string.

That’s done by the string deduplication feature added into Java 8 update 20. How it works:


  1. You need to use G1 garbage collector and turn this feature on: -XX:+UseG1GC -XX:+UseStringDeduplication. This feature is implemented as an optional step of G1 garbage collector and not available if you are using any other garbage collector.
  2. This feature may be executed during minor GC of G1 collector. In my observations, it depends on the availability of spare CPU cycles. So, don’t expect it to work in a data cruncher which has all the data to process locally. On the other hand, a web server is likely to execute it very often.
  3. String deduplication is looking for not processed strings, calculates their hash codes (if not calculated before by the application code) and then looks if there are any other strings with the same hash code and the equal underlying char[]. If found – it replaces a new string char[] with an existing string char[].
  4. String deduplication is processing only strings which have survived a few garbage collections. This ensures that a majority of very short living strings will not be processed. The minimal string age is managed by -XX:StringDeduplicationAgeThreshold=3 JVM parameter (3 is the default value of this parameter).

There are several important consequences of this implementation:

  • Yes, you need to use G1 collector if you want a free lunch string deduplication feature. You can’t use it with a parallel GC, which is generally a better choice for applications favoring throughput over latency.
  • String deduplication is unlikely to run on the loaded system. To check if it is invoked, run JVM with -XX:+PrintStringDeduplicationStatistics option and look at the console output.
  • If you need to save memory and you can intern strings in your application – do it, don’t rely on string deduplication. Keep in mind that string deduplication will process all (or at least most of) your strings – it means that even if you know that a given variable contents are unique (GUID, for example), JVM does not know about it and will try to match it to the other strings. As a result, string deduplication CPU cost depends both on the number of strings in the heap (a new string will be compared with some of them) and a number of strings you create between string deduplications (these strings will be compared to the heap strings). Use -XX:+PrintStringDeduplicationStatistics JVM option on the multi gigabyte heaps to check the impact of this feature.
  • On the other hand, this is done in mostly non-blocked fashion, so if your server has enough spare CPU capacity, why not to use it? :)
  • Finally, remember that String.intern will allow you to target only a subset of strings in your application which is known to contain duplicates. Generally it means a smaller pool of interned strings to compare with, which means you can use your CPU more efficiently. Besides, it allows you to intern full String objects, thus saving extra 24 bytes per string.

Here is a test class I have used to experiment with this feature. Each of 3 tests is running until JVM will throw OOM, so you have to run them separately.

The first test creates strings with unique content, but it is useful if you want to estimate the time it takes to deduplicate the strings when there is a huge number of them in the heap. Try to give as much heap as you can to the first test – the more strings it will create, the better.

The second and the third tests compare deduplication (second test) and interning (third test). You need to run them with the identical Xmx setting. I have tuned the constant in the program for Xmx256M, but you cam allocate more. Anyway, you will see that deduplication test will fail after less iterations then interning test. Why? Because we have only 100 distinct strings in these tests, so interning them means that the only memory you need is the list where those strings are stored. Deduplication, on the other hand, leaves distinct String objects, canonicalizing only the underlying char[].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/**
 * String deduplication vs interning test
 */
public class StringDedupTest {
    private static final int MAX_EXPECTED_ITERS = 300;
    private static final int FULL_ITER_SIZE = 100 * 1000;
 
    //30M entries = 120M RAM (for 300 iters)
    private static List<String> LIST = new ArrayList<>( MAX_EXPECTED_ITERS * FULL_ITER_SIZE );
 
    public static void main(String[] args) throws InterruptedException {
        //24+24 bytes per String (24 String shallow, 24 char[])
        //136M left for Strings
 
        //Unique, dedup
        //136M / 2.9M strings = 48 bytes (exactly String size)
 
        //Non unique, dedup
        //4.9M Strings, 100 char[]
        //136M / 4.9M strings = 27.75 bytes (close to 24 bytes per String + small overhead
 
        //Non unique, intern
        //We use 120M (+small overhead for 100 strings) until very late, but can't extend ArrayList 3 times - we don't have 360M
 
        /*
          Run it with: -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics
          Give as much Xmx as you can on your box. This test will show you how long does it take to
          run a single deduplication and if it is run at all.
          To test when deduplication is run, try changing a parameter of Thread.sleep or comment it out.
          You may want to print garbage collection information using -XX:+PrintGCDetails -XX:+PrintGCTimestamps
        */
 
        //Xmx256M - 29 iterations
        fillUnique();
 
        /*
         This couple of tests compare string deduplication (first test) with string interning.
         Both tests should be run with the identical Xmx setting. I have tuned the constants in the program
         for Xmx256M, but any higher value is also good enough.
         The point of this tests is to show that string deduplication still leaves you with distinct String
         objects, each of those requiring 24 bytes. Interning, on the other hand, return you existing String
         objects, so the only memory you spend is for the LIST object.
         */
 
        //Xmx256M - 49 iterations (100 unique strings)
        //fillNonUnique( false );
 
        //Xmx256M - 299 iterations (100 unique strings)
        //fillNonUnique( true );
    }
 
    private static void fillUnique() throws InterruptedException {
        int iters = 0;
        final UniqueStringGenerator gen = new UniqueStringGenerator();
        while ( true )
        {
            for ( int i = 0; i < FULL_ITER_SIZE; ++i )
                LIST.add( gen.nextUnique() );
            Thread.sleep( 300 );
            System.out.println( "Iteration " + (iters++) + " finished" );
        }
    }
 
    private static void fillNonUnique( final boolean intern ) throws InterruptedException {
        int iters = 0;
        final UniqueStringGenerator gen = new UniqueStringGenerator();
        while ( true )
        {
            for ( int i = 0; i < FULL_ITER_SIZE; ++i )
                LIST.add( intern ? gen.nextNonUnique().intern() : gen.nextNonUnique() );
            Thread.sleep( 300 );
            System.out.println( "Iteration " + (iters++) + " finished" );
        }
    }
 
    private static class UniqueStringGenerator
    {
        private char upper = 0;
        private char lower = 0;
 
        public String nextUnique()
        {
            final String res = String.valueOf( upper ) + lower;
            if ( lower < Character.MAX_VALUE )
                lower++;
            else
            {
                upper++;
                lower = 0;
            }
            return res;
        }
 
        public String nextNonUnique()
        {
            final String res = "a" + lower;
            if ( lower < 100 )
                lower++;
            else
                lower = 0;
            return res;
        }
    }
}
/**
 * String deduplication vs interning test
 */
public class StringDedupTest {
    private static final int MAX_EXPECTED_ITERS = 300;
    private static final int FULL_ITER_SIZE = 100 * 1000;

    //30M entries = 120M RAM (for 300 iters)
    private static List<String> LIST = new ArrayList<>( MAX_EXPECTED_ITERS * FULL_ITER_SIZE );

    public static void main(String[] args) throws InterruptedException {
        //24+24 bytes per String (24 String shallow, 24 char[])
        //136M left for Strings

        //Unique, dedup
        //136M / 2.9M strings = 48 bytes (exactly String size)

        //Non unique, dedup
        //4.9M Strings, 100 char[]
        //136M / 4.9M strings = 27.75 bytes (close to 24 bytes per String + small overhead

        //Non unique, intern
        //We use 120M (+small overhead for 100 strings) until very late, but can't extend ArrayList 3 times - we don't have 360M

        /*
          Run it with: -XX:+UseG1GC -XX:+UseStringDeduplication -XX:+PrintStringDeduplicationStatistics
          Give as much Xmx as you can on your box. This test will show you how long does it take to
          run a single deduplication and if it is run at all.
          To test when deduplication is run, try changing a parameter of Thread.sleep or comment it out.
          You may want to print garbage collection information using -XX:+PrintGCDetails -XX:+PrintGCTimestamps
        */

        //Xmx256M - 29 iterations
        fillUnique();

        /*
         This couple of tests compare string deduplication (first test) with string interning.
         Both tests should be run with the identical Xmx setting. I have tuned the constants in the program
         for Xmx256M, but any higher value is also good enough.
         The point of this tests is to show that string deduplication still leaves you with distinct String
         objects, each of those requiring 24 bytes. Interning, on the other hand, return you existing String
         objects, so the only memory you spend is for the LIST object.
         */

        //Xmx256M - 49 iterations (100 unique strings)
        //fillNonUnique( false );

        //Xmx256M - 299 iterations (100 unique strings)
        //fillNonUnique( true );
    }

    private static void fillUnique() throws InterruptedException {
        int iters = 0;
        final UniqueStringGenerator gen = new UniqueStringGenerator();
        while ( true )
        {
            for ( int i = 0; i < FULL_ITER_SIZE; ++i )
                LIST.add( gen.nextUnique() );
            Thread.sleep( 300 );
            System.out.println( "Iteration " + (iters++) + " finished" );
        }
    }

    private static void fillNonUnique( final boolean intern ) throws InterruptedException {
        int iters = 0;
        final UniqueStringGenerator gen = new UniqueStringGenerator();
        while ( true )
        {
            for ( int i = 0; i < FULL_ITER_SIZE; ++i )
                LIST.add( intern ? gen.nextNonUnique().intern() : gen.nextNonUnique() );
            Thread.sleep( 300 );
            System.out.println( "Iteration " + (iters++) + " finished" );
        }
    }

    private static class UniqueStringGenerator
    {
        private char upper = 0;
        private char lower = 0;

        public String nextUnique()
        {
            final String res = String.valueOf( upper ) + lower;
            if ( lower < Character.MAX_VALUE )
                lower++;
            else
            {
                upper++;
                lower = 0;
            }
            return res;
        }

        public String nextNonUnique()
        {
            final String res = "a" + lower;
            if ( lower < 100 )
                lower++;
            else
                lower = 0;
            return res;
        }
    }
}

See also

JEP 192 - a formal description of String deduplication

Summary

  • String deduplication feature was added in Java 8 update 20. It is a part of G1 garbage collector, so it should be turned on with G1 collector: -XX:+UseG1GC -XX:+UseStringDeduplication
  • String deduplication is an optional G1 phase. It depends on the current system load.
  • String deduplication is looking for the strings with the same contents and canonicalizing the underlying char[] with string characters. You don't need to write code to use this feature, but it means you are being left with distinct String objects, each of those occupying 24 bytes. Sometimes it worth to intern strings explicitly using String.intern.
  • String deduplication does not process too young strings. The minimal age of processed strings is managed by -XX:StringDeduplicationAgeThreshold=3 JVM parameter (3 is the default value of this parameter).

The post String deduplication feature (from Java 8 update 20) appeared first on Java Performance Tuning Guide.

]]>
http://java-performance.info/java-string-deduplication/feed/ 5