Tag Archives: charset

Charset encoding and decoding in Java 7/8

by Mikhail Vorontsov

We will take a look at Charset encoder and decoder performance in Java 7/8. We will check the performance of the 2 following String methods with various charsets:

1
2
3
4
5
/* String to byte[] */
public byte[] getBytes(Charset charset);
/* byte[] to String */
public String(byte bytes[], Charset charset);
    
/* String to byte[] */
public byte[] getBytes(Charset charset);
/* byte[] to String */
public String(byte bytes[], Charset charset);
    

I have translated phrase “Develop with pleasure” into German, Russian, Japanese and traditional Chinese using Google Translate. We will build chunks of given size from these phrases by concatenating them using “\n” as a separator until we reach the required length (in most cases the result will be slightly longer). After that we will convert 100 million characters of such data to byte[] and back to String (100M is the total data length in Java chars). We will convert data 10 times in order to make results more reliable (as a result, times in the following table are the times to convert 1 billion chars).

We will use 2 chunk sizes: 100 chars to test the performance of short string conversion and 100M chars to test the raw conversion performance. You can find the source code at the end of this article. We will compare encoding into a “national” charset (US-ASCII for English; ISO-8859-1 for German; windows-1251 for Russian; Shift_JIS for Japanese; GB18030 for Traditional Chinese) with UTF-8 encoding, which could be used as a universal encoding (at a possible price of a longer binary representation). We will also compare Java 7u51 with Java 8 (original release). All tests were run on my Xeon-2650 (2.8Ghz) workstation with -Xmx32G setting (to avoid GC).

Here are the test results. Each cell has two times: Java7_time (Java8_time). “UTF-8” line, which follows every “national” charset line contains conversion times for the data from the previous line (for example, the last line contains times to encode/decode a string in the traditional Chinese into UTF-8).

Continue reading