String packing part 1: converting characters to bytes

by Mikhail Vorontsov

31 December 2014 update: minor cleanup at the beginning of this article – we are getting further away from Java 6.

From time to time you may need to keep a huge number of strings in memory. Each String contains a character array – actual characters of that String and

Prior to Java 7 update 6 3 int fields – hash code, offset of that string first character in the array and length of the string
Java 7 from update 6 (but before Java 8) 2 int fields – 2 hash codes (normal and murmur)
From Java 8 1 int field – hash code

Let’s see if we can optimize a theoretical string object in terms of memory. It is easy to notice that offset and length are redundant in case when a string was not created in String.substring method (internal char array is not shared with any other String – this is a default behavior since Java 7 update 6). Hash code can also be calculated on each hashCode call instead of caching. Only char array seems to be required in the general case.

Nevertheless, we can avoid using char array. A lot of programs process all its data in only one encoding at a time (or at least, most of data in one encoding). We can convert characters to bytes on the fly into that encoding, so we may keep array of bytes instead of array of chars. For a more general case, UTF-8 may be considered as a safe encoding. Could we do any better? In order to answer this question, we need to review Java object fields memory layout.

32 and 64 bit modes

JVM runs by default using 32 bit object references. JVM will switch to 64 bit object references in one of 2 cases:

  • A JVM is started with over 32G heap
  • -XX:-UseCompressedOops is turned off (you should never do this for production systems)

There are just 2 differences between these modes in terms of memory layout:

  1. Object references occupy 4 bytes in 32 bit mode and 8 bytes in 64 bit mode (object references include references to arrays).
  2. Java object header occupy 12 bytes in 32 bit mode and 16 bytes in 64 bit mode (due to embedded Class reference)

Java object memory layout

All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by System.identityHashCode method). Arrays have 4 more bytes (one int field) containing array length. Object header is ended with a reference to an object Class (this reference occupy 4 bytes on under 32G heaps and 8 bytes on over 32G heaps). These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example, chars should be aligned by 2 bytes boundary). Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on) in 32 bit mode (a common case we will discuss in this article) or N*8 bytes in 64 bit mode. In this case 100% memory will contain useful data.

Implementing memory efficient strings

The following logic applies to 32 bit JVM mode (with 12 bytes object headers). You can read some ideas about 64 bit mode here.

Now let’s return to our scenario. In case of keeping a huge number of strings, length of at least 90% of them is generally limited by some rather small number (less than 100, for example). Let us create a number of classes. First one will support all strings with length less or equal to 12. Second one is for strings with length between 13 and 20 (both inclusive), third for lengths from 21 to 28 and so on (as many as it is necessary for your data). Each class will contain N * 2 + 1 int fields (3 ints for lengths up to 12, 5 ints for lengths between 13 and 20 and so on). First field will contain four first string bytes (we have already converted it to byte array), next field – bytes between 5 and 8 and so on. The last one or two ints will contain up to last 8 bytes or zeroes for bytes after last string byte. This, of course, adds a limitation that char=0 is not allowed in our data.

If an input string is too long or can’t be converted to selected encoding or contains (char)0, we may roll back to using the input String. Mixing all these artificial strings and real Strings is easy – treat all of them as Objects, but call their toString methods to get original String (of course, you’ll have to write it, as well as equals and hashCode if you want to use your strings as map keys).

What are our savings in case of 28 characters long string? Normal String will occupy 32 bytes + size of char array. It occupies 12 header bytes plus length * 2 bytes for array data plus padding. In case of 28 char long strings it will be 12 + 28*2 = 68 bytes padded to 72 bytes. So deep size of such String is 72 + 32 = 104 bytes.
Our artificial string will occupy 8(header) + 4(class) + 4*7(7 int fields) = 40 bytes. For 12 char long strings deep size for String is 32 + (12 + 12 * 2 = 36 padded to 40 bytes) = 72 bytes. Our artificial string deep size will be 8 + 4 + 4*3 = 24 bytes.

You can read more about Java objects memory layout, for example at http://www.codeinstructions.com/2008/12/java-objects-memory-structure.html.

Java 6 update 23 has added a runtime option called -XX:+UseCompressedStrings, which adds another java.lang.String implementation, which stores string characters in a byte array if possible. Oracle does not specify which encoding they use for this conversion, so it will be safe to suppose US-ASCII. If your strings contain national alphabet characters, Java will roll back to using normal char[]-based strings. All byte and char array based strings have absolutely the same interface and look absolutely indistinguishably to calling code. Performance penalty for using this option is rather small, so I will recommend to use it whenever possible if you are still using Java 6.

Adding this option to Java startup parameters will leave not so much of this chapter optimization. Byte-array based string will require 24 + length bytes compared to our artificial string 4 + length bytes. Savings will be still visible on short strings. The only truly useful case for described optimization is conversion to an encoding not supported by default JDK conversion.

Here is a possible implementation of such string for the first case (up to 12 characters):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
private static final Charset US_ASCII = Charset.forName( "US-ASCII" );
 
public static Object convert( final String str )
{
    //discard empty or too long strings as well as strings with '\0'
    if ( str == null || str.length() == 0 || str.length() > 12 || str.indexOf( '\0') != -1 )
        return str;
    //encoder may be stored in ThreadLocal
    final CharsetEncoder enc = US_ASCII.newEncoder();
    final CharBuffer charBuffer = CharBuffer.wrap( str );
    try {
        final ByteBuffer byteBuffer = enc.encode( charBuffer );
        final byte[] byteArray = byteBuffer.array();
        if ( byteArray.length <= 12 )
            return new Packed12( byteArray );
        //add cases for longer strings here
        else
            return str;
    } catch (CharacterCodingException e) {
        //there are some chars not fitting to our encoding
        return str;
    }
}
 
private static abstract class PackedBase
{
    protected int get( final byte[] ar, final int index )
    {
        return index < ar.length ? ar[ index ] : 0;
    }
 
    protected abstract ByteBuffer toByteBuffer();
 
    protected String toString( final ByteBuffer bbuf )
    {
        final byte[] ar = bbuf.array();
        //skip zero bytes at the tail of the string
        int last = ar.length - 1;
        while ( last > 0 && ar[ last ] == 0 )
            --last;
        return new String( ar, 0, last + 1, US_ASCII );
    }
 
    public String toString()
    {
        return toString( toByteBuffer() );
    }
}
 
private static class Packed12 extends PackedBase
{
    private final int f1;
    private final int f2;
    private final int f3;
 
    public Packed12( final byte[] ar )
    { //should be the same logic as in java.util.Bits.getInt, because ByteBuffer.putInt use it
        f1 = get( ar, 3 ) | get( ar, 2 ) << 8 | get( ar, 1 ) << 16 | get( ar, 0 ) << 24;
        f2 = get( ar, 7 ) | get( ar, 6 ) << 8 | get( ar, 5 ) << 16 | get( ar, 4 ) << 24;
        f3 = get( ar, 11 ) | get( ar, 10 ) << 8 | get( ar, 9 ) << 16 | get( ar, 8 ) << 24;
    }
 
    protected ByteBuffer toByteBuffer()
    {
        final ByteBuffer bbuf = ByteBuffer.allocate( 12 );
        bbuf.putInt( f1 );
        bbuf.putInt( f2 );
        bbuf.putInt( f3 );
        return bbuf;
    }
 
    @Override
    public boolean equals(Object o) {
        if ( this == o ) return true;
        if ( o == null || getClass() != o.getClass() ) return false;
        Packed12 packed12 = ( Packed12 ) o;
        return f1 == packed12.f1 && f2 == packed12.f2 && f3 == packed12.f3;
    }
 
    @Override
    public int hashCode() {
        int result = f1;
        result = 31 * result + f2;
        result = 31 * result + f3;
        return result;
    }
}
private static final Charset US_ASCII = Charset.forName( "US-ASCII" );

public static Object convert( final String str )
{
    //discard empty or too long strings as well as strings with '\0'
    if ( str == null || str.length() == 0 || str.length() > 12 || str.indexOf( '\0') != -1 )
        return str;
    //encoder may be stored in ThreadLocal
    final CharsetEncoder enc = US_ASCII.newEncoder();
    final CharBuffer charBuffer = CharBuffer.wrap( str );
    try {
        final ByteBuffer byteBuffer = enc.encode( charBuffer );
        final byte[] byteArray = byteBuffer.array();
        if ( byteArray.length <= 12 )
            return new Packed12( byteArray );
        //add cases for longer strings here
        else
            return str;
    } catch (CharacterCodingException e) {
        //there are some chars not fitting to our encoding
        return str;
    }
}

private static abstract class PackedBase
{
    protected int get( final byte[] ar, final int index )
    {
        return index < ar.length ? ar[ index ] : 0;
    }

    protected abstract ByteBuffer toByteBuffer();

    protected String toString( final ByteBuffer bbuf )
    {
        final byte[] ar = bbuf.array();
        //skip zero bytes at the tail of the string
        int last = ar.length - 1;
        while ( last > 0 && ar[ last ] == 0 )
            --last;
        return new String( ar, 0, last + 1, US_ASCII );
    }

    public String toString()
    {
        return toString( toByteBuffer() );
    }
}

private static class Packed12 extends PackedBase
{
    private final int f1;
    private final int f2;
    private final int f3;

    public Packed12( final byte[] ar )
    { //should be the same logic as in java.util.Bits.getInt, because ByteBuffer.putInt use it
        f1 = get( ar, 3 ) | get( ar, 2 ) << 8 | get( ar, 1 ) << 16 | get( ar, 0 ) << 24;
        f2 = get( ar, 7 ) | get( ar, 6 ) << 8 | get( ar, 5 ) << 16 | get( ar, 4 ) << 24;
        f3 = get( ar, 11 ) | get( ar, 10 ) << 8 | get( ar, 9 ) << 16 | get( ar, 8 ) << 24;
    }

    protected ByteBuffer toByteBuffer()
    {
        final ByteBuffer bbuf = ByteBuffer.allocate( 12 );
        bbuf.putInt( f1 );
        bbuf.putInt( f2 );
        bbuf.putInt( f3 );
        return bbuf;
    }

    @Override
    public boolean equals(Object o) {
        if ( this == o ) return true;
        if ( o == null || getClass() != o.getClass() ) return false;
        Packed12 packed12 = ( Packed12 ) o;
        return f1 == packed12.f1 && f2 == packed12.f2 && f3 == packed12.f3;
    }

    @Override
    public int hashCode() {
        int result = f1;
        result = 31 * result + f2;
        result = 31 * result + f3;
        return result;
    }
}

Using this code, I've measured memory consumption of the following String array with and without -XX:+UseCompressedStrings option, as well as memory consumption of the similar array of packed strings.

1
2
3
4
5
6
7
8
9
private static final int SIZE = 10_000_000;
 
String[] strings = new String[ SIZE ];
for ( int i = 0; i < SIZE; ++i )
    strings[ i ] = "Aa" + i;
 
Object[] packed = new Object[ SIZE ];
for ( int i = 0; i < SIZE; ++i )
    packed[ i ] = convert( "Aa" + i );
private static final int SIZE = 10_000_000;

String[] strings = new String[ SIZE ];
for ( int i = 0; i < SIZE; ++i )
    strings[ i ] = "Aa" + i;

Object[] packed = new Object[ SIZE ];
for ( int i = 0; i < SIZE; ++i )
    packed[ i ] = convert( "Aa" + i );
String, no compression String, -XX:+UseCompressedStrings packed strings
722.48 Mb 645.47 Mb 268.46 Mb

As you can see, even compared to built-in JDK string compression, our algorithm performs extremely well on short strings due to no memory wasted on internal array management.

This test was run on Java 6 update 25. Java 7 does not support UseCompressedStrings option any more. Initial release of Java 7 silently ignored it, Java 7 update 2 wrote a warning to console:

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option UseCompressedStrings; support was removed in 7.0

Summary

  • java.lang.String objects were designed to be fast and flexible. That's why they can share internal char[] with other strings. They also cache the calculated hash code value, because strings are often used as HashMap keys or HashSet values. But these properties add a great penalty for short strings memory footprint. We can avoid this penalty by implementing our own string replacement objects.
  • Oracle developers tried to solve the same problem in late Java 6 releases by introducing -XX:+UseCompressedStrings option. Unfortunately, it is not supported anymore in Java 7, maybe due to not so big memory savings as one may expect from it.

Source code

Source code is available here.


4 thoughts on “String packing part 1: converting characters to bytes

  1. Pingback: Use case: Optimizing memory footprint of read only csv file (Trove, Unsafe, ByteBuffer, data compression)

  2. Pingback: An overview of memory saving techniques in Java  - Java Performance Tuning Guide

Leave a Reply

Your email address will not be published. Required fields are marked *