Charset encoding and decoding in Java 7/8

by Mikhail Vorontsov

We will take a look at Charset encoder and decoder performance in Java 7/8. We will check the performance of the 2 following String methods with various charsets:

1
2
3
4
5
/* String to byte[] */
public byte[] getBytes(Charset charset);
/* byte[] to String */
public String(byte bytes[], Charset charset);
    
/* String to byte[] */
public byte[] getBytes(Charset charset);
/* byte[] to String */
public String(byte bytes[], Charset charset);
    

I have translated phrase “Develop with pleasure” into German, Russian, Japanese and traditional Chinese using Google Translate. We will build chunks of given size from these phrases by concatenating them using “\n” as a separator until we reach the required length (in most cases the result will be slightly longer). After that we will convert 100 million characters of such data to byte[] and back to String (100M is the total data length in Java chars). We will convert data 10 times in order to make results more reliable (as a result, times in the following table are the times to convert 1 billion chars).

We will use 2 chunk sizes: 100 chars to test the performance of short string conversion and 100M chars to test the raw conversion performance. You can find the source code at the end of this article. We will compare encoding into a “national” charset (US-ASCII for English; ISO-8859-1 for German; windows-1251 for Russian; Shift_JIS for Japanese; GB18030 for Traditional Chinese) with UTF-8 encoding, which could be used as a universal encoding (at a possible price of a longer binary representation). We will also compare Java 7u51 with Java 8 (original release). All tests were run on my Xeon-2650 (2.8Ghz) workstation with -Xmx32G setting (to avoid GC).

Here are the test results. Each cell has two times: Java7_time (Java8_time). “UTF-8″ line, which follows every “national” charset line contains conversion times for the data from the previous line (for example, the last line contains times to encode/decode a string in the traditional Chinese into UTF-8).

Continue reading


Base64 encoding and decoding performance

by Mikhail Vorontsov

02 Apr 2014 update: added Guava implementation and byte[] < -> byte[] section.

21 Mar 2014 update: major rewrite + added javax.xml.bind.DatatypeConverter class description.

21 Feb 2014 update: added MiGBase64 class description.

25 Dec 2013 update: added Java 8 java.util.Base64 class description.

We will discuss what is Base64 algorithm and what is the performance of several different well-known libraries implementing Base64 encoding/decoding.

Base64 is an algorithm mapping all 256 byte values to 64 printable byte values (printable means that those bytes are printed in US-ASCII encoding). This is done by packing 3 input bytes to 4 output bytes. Base64 is generally used in text-based data exchange protocols when there is still a need to transfer some binary data. The best known example is encoding of e-mail attachments.

JDK Base64 implementations

Surprisingly, there was no Base64 implementation in the core JDK classes before Java 6. Some web forums advise to use two non-public sun.* classes which are present in all Sun/Oracle JDK: sun.misc.BASE64Encoder and sun.misc.BASE64Decoder. The advantage of using them is that you don’t need to ship any other libraries with your application. The disadvantage is that those classes are not supposed to be used from outside JDK classes (and, of course, they can be removed from JDK implementation… in theory, at least).

Sun has added another Base64 implementation in Java 6 (thanks to Thomas Darimont for his remainder!): it was hidden in javax.xml.bind package and was unknown to many developers. javax.xml.bind.DatatypeConverter class has 2 static methods – parseBase64Binary and printBase64Binary, which are used for Base64 encoding and decoding.

Java 8 has finally added a Base64 implementation in the java. namespace – java.util.Base64. This static factory class provides you with the basic/MIME/URL and filename safe encoder and decoder implementations.

Surprisingly (or may be not), all these implementations do not share any logic even in Java 8.

Third party Base64 implementations

I will also mention 4 quite well known Base64 third party implementations.

  • The first one is present in the Apache Commons Codec library and called org.apache.commons.codec.binary.Base64.
  • The next one is present in the Google Guava library and accessible via com.google.common.io.BaseEncoding.base64() static method.
  • Another one was written by Robert Harder and available from his website: http://iharder.net/base64. This is a single class which you will have to add to your project.
  • The last one was written by Mikael Grev nearly 10 years ago. It is available from http://migbase64.sourceforge.net/. This is also a single class you have to add into your project. This implementation claims to be the fastest Base64 implementation, but unfortunately this is not true any longer. Besides, it has a strictest limit on the maximal length of byte[] to decode (see below).

Continue reading


Implementing a high performance Money class

by Mikhail Vorontsov

This article will discuss how to efficiently implement a Money class in Java. This article follows Using double/long vs BigDecimal for monetary calculations article published earlier in this blog.

Introduction

As I have written before, we should not use any floating point types for storing monetary values. We should use long instead, keeping the number of smallest currency units in it. Unfortunately, there is a couple of problems:

  • We may still have to communicate with a software storing monetary values in double variables.
  • We may want to multiply it by a non-integer number, like 0.015, or divide it by any type of number, thus getting the non-integer result.

The easy way to deal with arithmetic operations is to use the BigDecimal class for all of them. Unfortunately, BigDecimal will not help you to deal with already incorrect double calculations results. For example, the following code will print 0.7999999999999999:

1
System.out.println( new BigDecimal( 0.7 + 0.1, MathContext.DECIMAL64 ) );
System.out.println( new BigDecimal( 0.7 + 0.1, MathContext.DECIMAL64 ) );

So, we may say that BigDecimal operations will produce the correct result if it had the correct arguments. We can not say the same about double operations.

double operations may end up with the exact result or a result which is one bit off the exact result. This is most likely caused by executing double operations using higher precision on CPU and then rounding the result. All we need to do to get the exact result is to look at these 3 values (result; result plus one bit value, which is also known as ulp; and result minus ulp). One of them would be the correct one. Easy to say, difficult to efficiently implement :)

Continue reading


JSR 310 – Java 8 Date/Time library performance (as well as Joda Time 2.3 and j.u.Calendar)

by Mikhail Vorontsov

Introduction

This is the third date/time article in this blog. I advice you to look at the other two as well: java.util.Date, java.util.Calendar and java.text.SimpleDateFormat and Joda Time library performance.

This article is a short overview of the new Java 8 date/time implementation also known as JSR-310. I will compare JSR-310 implementation and performance with their counterparts from Joda Time library as well as with the good old java.util.GregorianCalendar. This review was written and tested on Java 8 ea b121.

All new Java 8 classes are implemented around the human time concept – separate fields for years, months, days, hours, minutes, seconds, and, in line with the current fashion, nanoseconds. Their counterpart is a machine time – number of milliseconds since epoch, which you may obtain, for example, via System.currentTimeMillis() call. In order to convert time between 2 these systems you will need to know which timezone to use. A timezone defines the offset from UTC used in conversions. Offset calculation may require the use of transition table or transition rules defining when the daylight savings changes happen. Sometime it may become a performance bottleneck.

JSR-310 implementation was inspired by a Joda Time library – both libraries have the similar interface, but their implementations differ greatly – Java 8 classes are built around the human time, but Joda Time is using machine time inside. As a result, if you are looking for the fastest implementation, I would recommend you to use Java 8 classes except the situations when:

  • You can’t use Java 8 (yeah, not many people can use it before the first official release…)
  • You work strictly with the machine time inside a few day range (in this case manual long/int based implementation will be faster).
  • You have to parse timestamps including offset/zone id.

Continue reading


Joda Time library performance

by Mikhail Vorontsov

Joda Time library is an attempt to create a cleaner and more developer-friendly date/time library than an existing JDK Date/Calendar/DateFormat ecosystem. This article provides a top level overview of Joda Time library from performance point of view. It may be also useful as a short introduction to Joda Time. I recommend reading java.util.Date, java.util.Calendar and java.text.SimpleDateFormat article before reading this one in order to better understand the performance of built-in JDK classes.

This article discusses Joda Time library versions 2.1 – 2.3.


26 Jan 2014: This is a major article rewrite – a major performance issue was found in Joda Time implementation.

Date/time storage classes

There are five general purpose date/time classes in this library:

  • DateTime – full replacement of java.util.Calendar, supporting time zones and date/time arithmetic. This class is immutable.
  • MutableDateTime – mutable version of DateTime.
  • LocalDate – immutable version of DateTime containing only date fields. This class does not use a timezone after construction.
  • LocalTime – immutable version of DateTime containing only time fields. This class does not use a timezone after construction.
  • LocalDateTime – immutable version of DateTime not using a timezone.

All these classes contain 2 mandatory fields – long with millis since epoch and a Chronology object, containing all timezone and calendar system (Gregorian, Buddhist, etc.) related logic. Some of these classes contain 1-2 more fields. An instance of these classes occupy from 24 to 40 bytes.

Since these classes are based on the “machine” time – millis since epoch, we may expect the better performance on from/to long conversions and worse performance on to/from date components (human time) conversions. In reality, Joda developers have made a series of clever optimizations which make it really fast even on the human time calculations.

Joda Time timezone offset calculation performance bug (ver 2.3)

Let’s start an updated version of this article from this point :) When I wrote an original version of this article in late 2012, I noticed the poor performance of timezone based logic in Joda Time, but I thought that it was the common property of that library, so I wrote that “this logic is definitely worth optimizing in Joda Time”.

One year later I have written a more comprehensive set of tests which includes the new Java 8 date/time implementation (JSR-310). This test set has highlighted some weird inconsistencies between various date/time implementations. In particular I have noticed that creating a Joda MutableDateTime from date components was 3-4 times faster than the same operation on date+time components. Nevertheless both were using the identical client logic. But there was a difference – date tests were using years from 1981 to 2000 (in a loop). DateTime tests were using 2013. This turned out to be a key to the problem.

MutableDateTime, as well as some other Joda Time classes are calculating timezone offset for the given “millis since epoch” values from time to time. It may be calculated more than once per client API call. Deep under the hood there is a org.joda.time.tz.DateTimeZoneBuilder$PrecalculatedZone class with getOffset method. This method looks up the transition table for the given timezone using binary search. If your timestamp is less or equal to the biggest timestamp in the table – you get it. Otherwise org.joda.time.tz.DateTimeZoneBuilder$DSTZone.getOffset method is called for every offset you calculate. It uses daylight savings transition rules to calculate the latest transition and use it for offset calculation. Calculated values are not cached on this branch.

I have noticed this difference between years 2008 and 2009 in “Australia/Sydney” timezone. After that I ran the same test on all available timezones and found a list of zones in/around Australia and New Zealand with the same performance issue – offsets in 2009 were calculated much slower than in 2008. At the same time I have noticed that European timezones were slow in both 2008 and 2009. This led me to the conclusion.

Joda time ships with timezone rule source files in the src/main/java/org/joda/time/tz/src directory. If you’ll take a look at “australasia” file and look for the “New South Wales” rule, you will see that 2 its last lines are:

Rule	AN	2008	max	-	Apr	Sun>=1	2:00s	0	-
Rule	AN	2008	max	-	Oct	Sun>=1	2:00s	1:00	-

This is the last fast year – 2008 (and I used 1st January for testing). After that I got more suspicious and opened “europe” file and got really worried. For example, Netherlands (Europe/Amsterdam) last rule belongs back to 1945:

Rule	Neth	1945	only	-	Apr	 2	2:00s	1:00	S
Rule	Neth	1945	only	-	Sep	16	2:00s	0	-
    

The longer a country lives on a stable daylight saving rule, the more it gets penalized by Joda Time: it takes ~3.6 seconds to create 10M MutableDateTime objects for year 2008 in “Europe/Amsterdam” timezone, ~3 seconds to create 10M MutableDateTime objects for year 2010 in “Australia/Sydney” timezone (which also has to calculate its daylight savings transitions), but only ~1.2 seconds for year 2008 in Sydney (precalculated table).


So, I would like to ask Joda Time maintainers to consider prebuilding such transition tables during the timezone construction (at runtime) up to at least the current year plus a few years more.

Continue reading


java.util.Random and java.util.concurrent.ThreadLocalRandom in multithreaded environments

by Mikhail Vorontsov

Generating pseudorandom data

There are 2 types of random generators in Java: pseudorandom and secure. Pseudorandom generators are transforming a seed into a new portion of pseudorandom data based on some formula. Secure random generators are using some machine specific sources of actually random events (file/socket access, for example) to generate its data.

Secure random generators:

  • should be used only when cryptographically strong random data is required
  • are slow
  • could be waiting for external events (“stuck”) if you have requested a lot of random data (Linux /dev/random is an example of such generator)

Pseudorandom generators, on the other hand, depend only on the initial “seed” value, so you can generate the same sequence of pseudorandom events if you will provide the same seed to the algorithm. In general case, such generators are fast because they are CPU bound only (do not rely on any IO). We will review the evolution of pseudorandom generators in Java and the reasons behind these changes.

java.util.Random

java.util.Random is available from Java 1.0. It is a thread safe class, so you may share (in theory) instances of this class between several threads and do not expect to get the same random data in 2 threads at the same time. Such thread safety is achieved via using an AtomicLong for the generator seed.

Random uses AtomicLong CAS (compare-and-set) operations for updating its seed. Despite being a light non-blocking primitive used in a lot of non-blocking algorithms, CAS behaves really poorly under the high contention. Wait for test results to see how poorly it behaves.

java.util.concurrent.ThreadLocalRandom

java.util.concurrent.ThreadLocalRandom was added in Java 7 as an attempt to overcome all performance issues associated with java.util.Random. This new class extends java.util.Random, so you may pass it to all logic expecting the old java.util.Random.

Here are main ThreadLocalRandom implementation details:

  • Internally it does not use an AtomicLong seed from Random. Instead it uses an ordinary long.
  • You can not create an instance of ThreadLocalRandom yourself, because its constructor is not public. Instead you should use its static factory ThreadLocalRandom.current(). This factory method queries internal ThreadLocal<ThreadLocalRandom>
  • It is CPU-cache aware, so it uses 8 long dummy fields for padding, thus pushing anything else out of its 64 byte L1 cache line.

All of these changes are important, which will be revealed by the following tests.

Continue reading


Core Java 7 Change Log

by Mikhail Vorontsov


01 Jan 2014 update – covered all Java 7 versions up to Java 7u45.

This article will list performance related changes made in the core part of various Java 7 updates. I will track changes in the following packages:

  • java.lang.*
  • java.io
  • java.math
  • java.net
  • java.nio.*
  • java.text.*
  • java.util.*

These changes have one peculiar property – Oracle usually does not include them in the release notes. Probably, they consider them too minor… Anyway, this page will try to help you a little: hopefully, you will not miss updates like a “famous” String.substring change in Java 7u6.

This page will be updated after new Java 7 updates. I will also add a list of changes done before Java 7u45 (give me some time!).

Java 7u45 (compared to Java 7u25)

Most of these changes were made in Java 7u40, but I have unfortunately missed that release.

Shared empty table in ArrayList and HashMap update

File changed: \util\ArrayList.java
File changed: \util\HashMap.java

Two most popular Java collections – ArrayList and HashMap are now consuming less memory if their instances are empty (size=0). The only reason for this update, from my point of view is a critical mass of cases in Oracle performance benchmarks when a map/list is created, but not populated later due to a conditional logic (for example, empty parameter list was provided in the HTTP request).

This update reduces the garbage collector workload by creating less unused garbage.

In case of ArrayList, only objects created via an empty constructor are updated. If you are using a constructor specifying the initial size, there are no changes at all for you.

1
2
3
4
5
6
7
8
9
/**
 * Shared empty array instance used for empty instances.
 */
private static final Object[] EMPTY_ELEMENTDATA = {};
 
public ArrayList() {
    super();
    this.elementData = EMPTY_ELEMENTDATA;
}
/**
 * Shared empty array instance used for empty instances.
 */
private static final Object[] EMPTY_ELEMENTDATA = {};

public ArrayList() {
    super();
    this.elementData = EMPTY_ELEMENTDATA;
}

HashMap got a more thorough update – internal table initialization is now moved away from the constructor. It is now always initialized with an empty table. As a consequence, all getters and setters in the HashMap are now checking if a map is empty. This makes getters slightly faster in case of an empty map – you no longer have to look at the actual table, but at the same time it makes all setters/getters slower in every other case (this is a very tiny slowdown – a zero check of a single int field, but this is still an extra instruction to execute).

I hope that this HashMap change has a solid justification – there are enough always empty maps in the Oracle benchmark applications. In my personal opinion this change is quite questionable – I do not like “taxes” you always have to pay just for a corner case optimization.

1
2
3
4
5
6
7
8
9
/**
 * An empty table instance to share when the table is not inflated.
 */
static final Entry<?,?>[] EMPTY_TABLE = {};
 
/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;
/**
 * An empty table instance to share when the table is not inflated.
 */
static final Entry<?,?>[] EMPTY_TABLE = {};

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry<K,V>[] table = (Entry<K,V>[]) EMPTY_TABLE;

An author of these changes has responded me here.

Continue reading


Book review: Systems Performance: Enterprise and the Cloud

All you need to know about Linux performance monitoring and tuning.

If you have visited this blog, you are likely to be interested in Java performance tuning. You want to know why ArrayList is better than LinkedList, why HashMap will usually outperform TreeMap (but we should still use the latter if we want a sorted map) or why date parsing could be so painfully slow.

Nevertheless, at some point of your career you will reach the situation when you will have to consider your application environment – server hardware, other applications running on your server and other servers running in your network (as well as many other things).

You may for example want to know why disk operations were so quick on your development box, but became a major issue on the production box. There could be various reasons:

  • Trying to acquire a file lock on NFS too often
  • Other process is using the same disk – legally or due to a misconfiguration
  • Operating system is using the same disk for paging
  • Your development box has an SSD installed, but a production box relies on the “ancient” :) HDD technology
  • Or lots of other reasons

Or you may be on the other side of the spectrum and trying to squeeze the last cycles out of a critical code path. In this situation you may want to know which levels of memory hierarchy your code is accessing (L1-L3 CPU caches, RAM, disks). Java does not provide you such information, so you have to use OS monitoring tools to obtain it. This will allow you to modify your algorithm, tune your dataset size so it will fit into the appropriate level of memory hierarchy.

Or you are probably on the edge of the progress and want to deploy your brand new application on the cloud. The biggest issue with clouds is that you have to pay for everything – excessive CPU usage (as well as for non-excessive :) ), suboptimal IO as well as high memory consumption (usually via a requirement to pay for the larger and more expensive instances). Besides that your application might be affected by the other tenants of the same physical box – for example HDD is a non-interruptible device – one of tenants can make a temporary denial of service “attack” on the other tenants while it is still in his quota. What tools and strategies would you use for performance monitoring/tuning in the cloud?

“Systems Performance: Enterprise and the Cloud” by Brendan Gregg is the best reference book I have seen on Linux and Solaris performance monitoring. It is written for system administrators, so it is not bound to any programming languages. The book starts with a description of methodologies which could be used for performance issue troubleshooting. Introduction chapters are followed by the chapters related to the following hardware components:

  • CPU
  • Memory
  • File systems
  • Disks
  • Network
  • Cloud computing

Each of these chapters starts with an overview of a given hardware component followed by possible performance tuning methodologies description.

The last chapter of this book describes a real world performance investigation (in my opinion, you should start reading this book from this chapter :) ).

I would recommend to order a paper version of this book, because it should serve as a handy reference book for the complex performance issue investigations.



Static code compilation in Groovy 2.0

by Mikhail Vorontsov

A tiny introduction to Groovy

Groovy is a JVM-based language useful for scripting and other non-CPU critical applications (like Grails web framework). It is a dynamically typed language allowing you to add methods and properties to existing objects at runtime.

The ability to add methods and properties at runtime is implemented via methodMissing and propertyMissing methods/handlers as well as a dynamic method registration. Such feature allows you to support your own DSLs via parsing not existing method names at runtime and registering the actual method bodies corresponding to such methods/properties. It allows you, for example, to generate database access method like List<Person> getPersonsByN( N n ) where N is any field defined in the Persons database table.

Such functionality made Groovy popular in the web development due to ability to generate repeating data access methods at runtime in the frameworks. Unfortunately (or luckily :) ), Groovy method calls are using the dynamic dispatch model – Groovy runtime chooses the best matching method signature based on the runtime argument types instead of compile time argument types, like Java does. Dynamic dispatch requires each Groovy method call to use the Groovy runtime method lookup code based on the reflection. So, are method calls in Groovy extremely slow? The answer is no – Groovy does a very good job of caching call sites, not making another reflection lookup if possible.

Groovy static compilation

One of the main features of Groovy 2.0 was the static compilation mode. It is turned on by annotating methods or the whole class with the @CompileStatic annotation. This annotation actually turns on 2 features:

  1. Static type checking
  2. Static compilation

Continue reading


String.intern in Java 7 and 8 – part 3

by Mikhail Vorontsov

I want to return to String.intern method I have discussed earlier ( part 1, part 2 ). During the last couple of months I have used interning heavily in my pet project in order to see the pros and cons of using String.intern for every non-temporary string in my application (non-temporary – the one which is expected to live longer than for a few seconds and most likely get into the GC old generation).

As I have written before, the advantages of String.intern in Java 7 and 8 are:

  • Rather fast execution, nearly no performance loss in the multithreaded mode (still using the global string pool).
  • Saving memory, thus allowing your dataset to shrink, which will let your application to run faster (in general).

The main disadvantages of this method (as I have noticed before) are:

  • The requirement to set -XX:StringTableSize=N JVM parameter in advance and dealing with its fixed value (needs JVM restart in order to expand the JVM string pool).
  • Adding calls to String.intern in a lot of places globally (probably via your own wrapper) – which has linear code maintenance cost.

After a couple of months of using String.intern in my project, I think that its usage should be limited to the fields having a limited domain of distinct values (like person first names or state/province names). We should not try to intern anything with a low probability of a repeating value – it would be a waste of CPU time.

Continue reading