String packing part 2: converting Strings to any other objects

by Mikhail Vorontsov

As we have found out in the previous string packing article, even an empty String will occupy at least 48 bytes. In some cases we may expect that some String variables will contain values which could be easily converted to primitive values. Unfortunately, other values of these variables may not be converted, so you will still need a String.

You may notice it either for single fields of some objects or you may have some collection of strings, like Map<Integer, String> – it doesn’t matter: you may apply this chapter ideas for both cases.

In order to save memory you need to replace Strings with Objects. You need a conversion method, which will be applied to all Strings you are trying to store and return either an original String or any other less memory-hungry Object. When you will need your original string back, all you need is just to call toString() method on your stored Object (probably, supporting null values as well). All other code must access your fields via such set/get methods.

The easiest types of replacement objects to support are Character for strings with length = 1, Integer for shorter numbers (because it has better caching support than other wrapper types), Long for longer numbers and Double for decimal point numbers.

There are still a few easy to forget pitfalls. First of all, all integer number conversion methods, like Integer.valueOf, support strings starting with zeroes, like “01234”. They will skip leading zeroes and convert the rest of the input string (result will be 1234 in the example). Conversion of 1234 back to string will return “1234”, of course. That’s why we can’t use numbers starting with ‘0’ (except zero itself).

Next, Double.toString method will use scientific notation for values longer than 9 digits + decimal separator, so we either must limit input strings converted to Double by 10 characters, or we must extend our getter method logic.

What can we do in the getter? Simplest one will just look like:

1
2
3
4
public static String unpack( final Object obj )
{
    return obj != null ? obj.toString() : null;
}
public static String unpack( final Object obj )
{
    return obj != null ? obj.toString() : null;
}

But we can, for example, support longer Double values by type checking input Objects and applying appropriate java.text.DecimalFormat to Double input objects. Or we can avoid 32 bytes of fixed String memory footprint by storing char[] as Object and converting it back to String in the getter. Or even better – convert a String into a UTF-8 encoded byte[].

We also should not forget about possibility of creating our own ‘wrapper’ objects to deal with some cases which can’t be converted straight to one of standard types, but is common across processed data. In the example, we are dealing with a string which consists of a character followed by a few digits, like ‘C12345’. We may create our own object to store a char and a long. Why long and not int? As it was said before, we have either 4 or 12 (or more) bytes to spend. char consumes 2 bytes and either we should fit in the remaining two, or we can use ten more.

Another example of special case is a timestamp stored in the following format: yyyyMMddHHmmssSSSSSS. Yes, it has 6 digits after seconds – these are microseconds. The full length of such string is 20 characters, which is more than 18 guaranteed digits which fit into a long. But you may notice that 2 first digits are ’20’ in most cases (you may also support ’19’ if you wish). This leaves you 18 digits to store in one long (and year 2100 problem to your ancestors), which, of course, must be a field of your custom object. toString() method of such object must prepend ’20’ back to a stored long value.

Here is a possible implementation of setter method and CharAndNumber class (first custom example):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
public static Object pack( final String str )
{
    if ( str == null || str.length() > 18 )
        return str;
    if ( str.isEmpty() )
        return ""; //ensure that interned empty string literal is returned
    final char firstChar = str.charAt( 0 );
    if ( str.length() == 1 ) //one char string is converted to Character
        return firstChar;
    //all wrapper classes valueOf methods support leading zeroes, but conversion back will not restore them
    final boolean startDigit1to9 = firstChar >= '1' && firstChar <= '9';
    if ( startDigit1to9 )
    {
        try
        { //Integers have extended caching support, prefer them
            return Integer.valueOf( str );    
        }
        catch ( NumberFormatException ex )
        {
            try
            { //fallback to Longs if possible
                return Long.valueOf( str );    
            }
            catch ( NumberFormatException ex2 )
            {                       
                try
                {
                    //convert to double and check if we can convert value back to the exactly original string
                    final Double val = Double.valueOf( str );
                    final String reverted = val.toString();
                    if ( reverted.equals( str ) )
                        return val;
                }
                catch ( NumberFormatException ignored )
                {
                }
            }
        }
    }
    //add support for other known data types here. Replacement objects must be as compact as possible.
    //Special type: character followed by several digits.
    final boolean secondDigit1to9 = str.charAt( 1 ) >= '1' && str.charAt( 1 ) <= '9';
    if ( secondDigit1to9 )
    {
        try
        {
            return new CharAndNumber( firstChar, Long.valueOf( str.substring( 1 ) ) );
        }
        catch ( NumberFormatException ignored )
        {
        }
    }
    return str;
}
 
private static class CharAndNumber
{
    public final char m_symbol;
    public final long m_number;
 
    public CharAndNumber( char symbol, long number ) {
        this.m_symbol = symbol;
        this.m_number = number;
    }
    
    public String toString()
    {
        return Character.toString( m_symbol ) + Long.toString( m_number );
    }
    
}
public static Object pack( final String str )
{
    if ( str == null || str.length() > 18 )
        return str;
    if ( str.isEmpty() )
        return ""; //ensure that interned empty string literal is returned
    final char firstChar = str.charAt( 0 );
    if ( str.length() == 1 ) //one char string is converted to Character
        return firstChar;
    //all wrapper classes valueOf methods support leading zeroes, but conversion back will not restore them
    final boolean startDigit1to9 = firstChar >= '1' && firstChar <= '9';
    if ( startDigit1to9 )
    {
        try
        { //Integers have extended caching support, prefer them
            return Integer.valueOf( str );    
        }
        catch ( NumberFormatException ex )
        {
            try
            { //fallback to Longs if possible
                return Long.valueOf( str );    
            }
            catch ( NumberFormatException ex2 )
            {                       
                try
                {
                    //convert to double and check if we can convert value back to the exactly original string
                    final Double val = Double.valueOf( str );
                    final String reverted = val.toString();
                    if ( reverted.equals( str ) )
                        return val;
                }
                catch ( NumberFormatException ignored )
                {
                }
            }
        }
    }
    //add support for other known data types here. Replacement objects must be as compact as possible.
    //Special type: character followed by several digits.
    final boolean secondDigit1to9 = str.charAt( 1 ) >= '1' && str.charAt( 1 ) <= '9';
    if ( secondDigit1to9 )
    {
        try
        {
            return new CharAndNumber( firstChar, Long.valueOf( str.substring( 1 ) ) );
        }
        catch ( NumberFormatException ignored )
        {
        }
    }
    return str;
}

private static class CharAndNumber
{
    public final char m_symbol;
    public final long m_number;

    public CharAndNumber( char symbol, long number ) {
        this.m_symbol = symbol;
        this.m_number = number;
    }
    
    public String toString()
    {
        return Character.toString( m_symbol ) + Long.toString( m_number );
    }
    
}

Of course, using such methods will require extra CPU cycles, so it worth using them only if you are keeping enough such objects in memory (usually more than 100,000) and, probably, you keep them for a noticeable time (compared to expenses to create such objects).

In case when you couldn’t convert your String into a primitive type variable, consider converting it to a byte[] in UTF-8 encoding. This is loseless encoding, so you may be sure you will get your original String back.

Important: this processing logic strongly relies on exceptions thrown in case of invalid numbers. Such approach has a severe performance penalty if your strings start with a digit, but have some non-convertible characters later, like “123/456a”. You can read more about it in the “Throwing an exception is very slow” article.

Summary

If you have a big collection of strings or objects with String fields in memory and you know that in some cases (at least at 10% of cases, for example) these strings may be actually converted to primitive type values, you may replace your String fields with Object fields and use provided pack/unpack methods to convert Strings to Objects and back, thus saving memory.

If you couldn’t convert a string to a primitive, consider converting your strings into byte[] in UTF-8 encoding. This is loseless conversion, so you could always convert your binary byte[] back into an original string.


One thought on “String packing part 2: converting Strings to any other objects

  1. Pingback: Protobuf data encoding for numeric datatypes  - Java Performance Tuning Guide

Leave a Reply

Your email address will not be published. Required fields are marked *