Tag Archives: java memory layout

String packing part 1: converting characters to bytes

by Mikhail Vorontsov

31 December 2014 update: minor cleanup at the beginning of this article – we are getting further away from Java 6.

From time to time you may need to keep a huge number of strings in memory. Each String contains a character array – actual characters of that String and

Prior to Java 7 update 6 3 int fields – hash code, offset of that string first character in the array and length of the string
Java 7 from update 6 (but before Java 8) 2 int fields – 2 hash codes (normal and murmur)
From Java 8 1 int field – hash code

Let’s see if we can optimize a theoretical string object in terms of memory. It is easy to notice that offset and length are redundant in case when a string was not created in String.substring method (internal char array is not shared with any other String – this is a default behavior since Java 7 update 6). Hash code can also be calculated on each hashCode call instead of caching. Only char array seems to be required in the general case.

Nevertheless, we can avoid using char array. A lot of programs process all its data in only one encoding at a time (or at least, most of data in one encoding). We can convert characters to bytes on the fly into that encoding, so we may keep array of bytes instead of array of chars. For a more general case, UTF-8 may be considered as a safe encoding. Could we do any better? In order to answer this question, we need to review Java object fields memory layout.

32 and 64 bit modes

JVM runs by default using 32 bit object references. JVM will switch to 64 bit object references in one of 2 cases:

  • A JVM is started with over 32G heap
  • -XX:-UseCompressedOops is turned off (you should never do this for production systems)

There are just 2 differences between these modes in terms of memory layout:

  1. Object references occupy 4 bytes in 32 bit mode and 8 bytes in 64 bit mode (object references include references to arrays).
  2. Java object header occupy 12 bytes in 32 bit mode and 16 bytes in 64 bit mode (due to embedded Class reference)

Java object memory layout

All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by System.identityHashCode method). Arrays have 4 more bytes (one int field) containing array length. Object header is ended with a reference to an object Class (this reference occupy 4 bytes on under 32G heaps and 8 bytes on over 32G heaps). These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example, chars should be aligned by 2 bytes boundary). Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on) in 32 bit mode (a common case we will discuss in this article) or N*8 bytes in 64 bit mode. In this case 100% memory will contain useful data.

Continue reading