31 December 2014 update: minor cleanup at the beginning of this article – we are getting further away from Java 6.
From time to time you may need to keep a huge number of strings in memory. Each
String contains a character array – actual characters of that
|Prior to Java 7 update 6||3
|Java 7 from update 6 (but before Java 8)||2
|From Java 8||1
Let’s see if we can optimize a theoretical string object in terms of memory. It is easy to notice that offset and length are redundant in case when a string was not created in
String.substring method (internal char array is not shared with any other
String – this is a default behavior since Java 7 update 6). Hash code can also be calculated on each
hashCode call instead of caching. Only char array seems to be required in the general case.
Nevertheless, we can avoid using char array. A lot of programs process all its data in only one encoding at a time (or at least, most of data in one encoding). We can convert characters to bytes on the fly into that encoding, so we may keep array of bytes instead of array of chars. For a more general case, UTF-8 may be considered as a safe encoding. Could we do any better? In order to answer this question, we need to review Java object fields memory layout.
32 and 64 bit modes
JVM runs by default using 32 bit object references. JVM will switch to 64 bit object references in one of 2 cases:
- A JVM is started with over 32G heap
-XX:-UseCompressedOopsis turned off (you should never do this for production systems)
There are just 2 differences between these modes in terms of memory layout:
- Object references occupy 4 bytes in 32 bit mode and 8 bytes in 64 bit mode (object references include references to arrays).
- Java object header occupy 12 bytes in 32 bit mode and 16 bytes in 64 bit mode (due to embedded
Java object memory layout
All Java objects start with 8 bytes containing service information like object class and its identity hash code (returned by
System.identityHashCode method). Arrays have 4 more bytes (one
int field) containing array length. Object header is ended with a reference to an object
Class (this reference occupy 4 bytes on under 32G heaps and 8 bytes on over 32G heaps). These fields are followed by all declared fields. All objects are aligned by 8 bytes boundary. All primitive fields must be aligned by their size (for example,
chars should be aligned by 2 bytes boundary).
Object reference (including any arrays) occupy 4 bytes. What does it mean for us? In order to get most use of available memory, all object fields must occupy N*8+4 bytes (4, 12, 20, 28 and so on) in 32 bit mode (a common case we will discuss in this article) or N*8 bytes in 64 bit mode. In this case 100% memory will contain useful data.