Charset encoding and decoding in Java 7/8

by Mikhail Vorontsov

We will take a look at Charset encoder and decoder performance in Java 7/8. We will check the performance of the 2 following String methods with various charsets:

1
2
3
4
5
/* String to byte[] */
public byte[] getBytes(Charset charset);
/* byte[] to String */
public String(byte bytes[], Charset charset);
    
/* String to byte[] */
public byte[] getBytes(Charset charset);
/* byte[] to String */
public String(byte bytes[], Charset charset);
    

I have translated phrase “Develop with pleasure” into German, Russian, Japanese and traditional Chinese using Google Translate. We will build chunks of given size from these phrases by concatenating them using “\n” as a separator until we reach the required length (in most cases the result will be slightly longer). After that we will convert 100 million characters of such data to byte[] and back to String (100M is the total data length in Java chars). We will convert data 10 times in order to make results more reliable (as a result, times in the following table are the times to convert 1 billion chars).

We will use 2 chunk sizes: 100 chars to test the performance of short string conversion and 100M chars to test the raw conversion performance. You can find the source code at the end of this article. We will compare encoding into a “national” charset (US-ASCII for English; ISO-8859-1 for German; windows-1251 for Russian; Shift_JIS for Japanese; GB18030 for Traditional Chinese) with UTF-8 encoding, which could be used as a universal encoding (at a possible price of a longer binary representation). We will also compare Java 7u51 with Java 8 (original release). All tests were run on my Xeon-2650 (2.8Ghz) workstation with -Xmx32G setting (to avoid GC).

Here are the test results. Each cell has two times: Java7_time (Java8_time). “UTF-8” line, which follows every “national” charset line contains conversion times for the data from the previous line (for example, the last line contains times to encode/decode a string in the traditional Chinese into UTF-8).

Charset getBytes, ~100 chars (chunk size) new String, ~100 chars (chunk size) getBytes, ~100M chars new String, ~100M chars
US-ASCII 2.451 sec(2.686 sec) 0.981 sec(0.971 sec) 2.402 sec(2.421 sec) 0.889 sec(0.903 sec)
UTF-8 1.193 sec(1.259 sec) 0.974 sec(1.181 sec) 1.226 sec(1.245 sec) 0.887 sec(1.09 sec)
ISO-8859-1 2.42 sec(0.334 sec) 0.816 sec(0.84 sec) 2.441 sec(0.355 sec) 0.761 sec(0.801 sec)
UTF-8 3.14 sec(3.534 sec) 3.373 sec(4.134 sec) 3.288 sec(3.498 sec) 3.314 sec(4.185 sec)
windows-1251 5.85 sec(5.826 sec) 2.004 sec(1.909 sec) 5.881 sec(5.747 sec) 1.902 sec(1.87 sec)
UTF-8 5.425 sec(5.256 sec) 11.561 sec(12.326 sec) 5.544 sec(4.921 sec) 11.29 sec(12.314 sec)
Shift_JIS 17.343 sec(9.355 sec) 24.85 sec(8.464 sec) 16.95 sec(9.24 sec) 24.6 sec(8.503 sec)
UTF-8 9.398 sec(13.201 sec) 12.007 sec(16.661 sec) 9.681 sec(11.801 sec) 12.035 sec(16.602 sec)
GB18030 18.754 sec(16.641 sec) 15.877 sec(16.267 sec) 18.494 sec(16.342 sec) 16.034 sec(16.406 sec)
UTF-8 9.374 sec(11.829 sec) 12.092 sec(16.672 sec) 9.678 sec(12.991 sec) 12.25 sec(16.745 sec)

Test results

We can notice the following facts:

  • There is nearly no CPU overhead in case of chunked output – chunked results will get worse if you will allocate less memory for this test.
  • Converting byte[] to String is very fast in case of single byte charsets (US-ASCII, ISO-8859-1, windows-1251): you know the input data size, you can allocate the result char[] of correct size ot once, and, if you are in the java.lang package, you can use one of package protected String constructors which does not require the char[] copying.
  • At the same time, String.getBytes(UTF-8) does not work that quickly for non-ASCII encodings – besides more complicated mapping, it has to allocate the maximal possible output char[] and then copy the actually used part of it into the resulting String. UTF-8 conversion speed is getting really slow in case of Chinese/Japanese.
  • String -> byte[] conversion is generally slower than byte[] -> String conversion in case of “national” charsets. Surprisingly, the opposite results are observed in case of UTF-8: String -> byte[] is generally faster than byte[] -> String.
  • Shift_JIS and ISO-8859-1 conversions (as well as probably some other) were seriously optimized in Java 8 (green highlighting): Japanese conversion is now running 2-3 times faster than in Java 7 in both directions. In case of ISO-8859-1, only String -> byte[] was optimized – it is running SEVEN TIMES FASTER NOW! This results sounds really amazing to me (see below).
  • One more noticeable difference is byte[] -> String conversion time for windows-1251 encoding compared to UTF-8 encoding (red highlighting). The difference between them is about 6 times (windows-1251 6 times faster than UTF-8). I am not sure if it is possible to justify it just by the difference in the binary representation: you need 1 byte per char in case of windows-1251, but 2 bytes per char (for Russian characters) in case of UTF-8. There is a similar, but smaller difference between ISO-8859-1 and UTF-8 (blue highlighting): there is only 1 character not fitting into 1 UTF-8 byte in the German string, but nearly every character in Russian string (except whitespaces) requires 2 UTF-8 characters.

Direct String->byte[]->String conversion for ASCII / ISO-8859-1 data

I have tried to investigate the brilliant performance of ISO-8859-1 encoder on Java 8. The algorithm itself is very simple – ISO-8859-1 charset exactly matches to the first 255 positions in the Unicode table, so it looks like:

1
2
3
4
if ( char <= 255 )
    write it as byte to output
else
    skip input char, write Charset.replacement byte
if ( char <= 255 )
    write it as byte to output
else
    skip input char, write Charset.replacement byte

The difference between Java 7 and 8 versions of ISO_8859_1.java is that Java 7 contains all encoding logic in the single method, but Java 8 has a helper method, which converts the input char[] while there is no char greater than 255. I think that introduction of such method helps the JIT to emit the more efficient code.

It is possible to beat JDK encoder for data which is known to be in US-ASCII or ISO-8859-1 encodings. You just have to assume that your string contains only valid characters for your encoding and get rid of all "plumbing":

1
2
3
4
5
6
7
private static byte[] toAsciiBytes( final String str )
{
    final byte[] res = new byte[ str.length() ];
    for (int i = 0; i < str.length(); i++)
        res[ i ] = (byte) str.charAt( i );
    return res;
}
private static byte[] toAsciiBytes( final String str )
{
    final byte[] res = new byte[ str.length() ];
    for (int i = 0; i < str.length(); i++)
        res[ i ] = (byte) str.charAt( i );
    return res;
}

This method beats ISO-8859-1 encoder by 20-25% in Java 8 and by 3-3.5 times in Java 7. Nevertheless, it depends on JIT to eliminate array access and String.charAt bounds checking.

It is not possible to beat byte[] -> String conversion for this couple of charsets. There are no public String constructors/factories, which would use a char[] provided by you. They all make a defensive copy (otherwise it is not possible to guarantee the immutability of String). The closest thing in terms of performance is a deprecated String(byte ascii[], int hibyte, int offset, int count) constructor. It is useful for byte[]->String encoding if your charset matches to the first 255 Unicode characters (US-ASCII, ISO-8859-1). Unfortunately, it this constructor copies data from the end to the head of the string, which is not quite CPU-cache friendly.

1
2
3
4
5
private static String asciiBytesToString( final byte[] ascii )
{
    //deprecated constructor allowing data to be copied directly into String char[]. So convenient...
    return new String( ascii, 0 );
}
private static String asciiBytesToString( final byte[] ascii )
{
    //deprecated constructor allowing data to be copied directly into String char[]. So convenient...
    return new String( ascii, 0 );
}

On the other hand, String(byte bytes[], int offset, int length, Charset charset) cuts all possible edges: for US-ASCII and ISO-8859-1 it allocates the char[] of required size, makes a cheap conversion (widening byte to char) and provides the result char[] to a String constructor, which will trust the encoder in this case.

Summary

  • Always prefer national charsets like windows-1252 or Shift_JIS to UTF-8: they produce more compact binary representation (as a rule) and they are faster to encode/decode (there are some exceptions in Java 7, but it becoming a rule in Java 8).
  • ISO-8859-1 always works faster than US-ASCII in Java 7 and 8. Choose ISO-8859-1 if you don't have any solid reasons to use US-ASCII.
  • You can write a very fast String->byte[] conversion for US-ASCII/ISO-8859-1, but you can not beat Java decoders - they have direct access to the output String they create.

Source code

EncodingTests


2 thoughts on “Charset encoding and decoding in Java 7/8

Leave a Reply

Your email address will not be published. Required fields are marked *