Regexp-related methods of String

by Mikhail Vorontsov

java.lang.String has several methods, which are actually shortcuts for various java.util.Pattern/Matcher methods. These are

  • matches(String)
  • replaceAll(String, String), replaceFirst(String, String)
  • split(String), split(String, int)

These methods have 2 performance drawbacks:

  • they create and compile Pattern objects internally for each call
  • they may be used for simple cases, when manual parsing is absolutely appropriate

For example, matches(String) method is implemented as Pattern.matches(regex, this), which tries to match current string (this) against the given regex. Pattern.matches is defined as Pattern.compile(regex).matcher(input).matches()

All other String regex methods look similar: compile a temporary Pattern, create a Matcher for current String and call required method on the Matcher object. This is appropriate if you are using these String methods just once or rarely, but it is better to avoid them if you are calling regex methods for each piece of data you are processing.

As an example, let’s experiment with the pattern "^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$". This regex checks if we have two digits hours followed by at least one space, which is followed by two digits minutes. For example, string “14 00” will match to this regex, but “14:00” wouldn’t match.

10 million calls were made as a test:

  1. “14 00”.matches( “^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$” )
  2. Pattern.compile( “^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$” )
  3. Precompiled this pattern( Pattern.compile called once outside loop ) and called pattern.matcher( “14 00 “).matches() in a loop

Here are test times:

  1. String.matches: 12.34 sec
  2. Pattern.compile: 10.82 sec
  3. Matcher.matches: 1.44 sec

Java 6 and 7 have roughly the same performance on these tests. Of course, these times depend on regex complexity and input data length, but anyway, precompiling a Pattern is always a good idea. At least, you wouldn’t lose time, even if you will call regex method only once. But if you are going to use a regexp multiple times, expect 5-10 times speedup compared to the case when you compile a Pattern on each matching.

Let’s see how to replace String methods with compiled Pattern and Matcher methods. For example, you have a call:

1
2
input.matches( "^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$" )
        
input.matches( "^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$" )
        

First of all, you need to extract a Pattern. Define a static final field:

1
2
private static final Pattern TIME_PATTERN = Pattern.compile( "^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$" );
        
private static final Pattern TIME_PATTERN = Pattern.compile( "^(([0-1][0-9])|(2[0-3]))\\s+([0-5][0-9])$" );
        

Now you can rewrite your original statement:

1
2
TIME_PATTERN.matcher( input ).matches()
        
TIME_PATTERN.matcher( input ).matches()
        

The same logic is applicable to 3 other regex methods:

Before: After
input.replaceAll("\\s+", ";") private static final Pattern SPACES_PATTERN = Pattern.compile( "\\s+" );
SPACES_PATTERN.matcher( input ).replaceAll( ";" );
replaceFirst Same logic as in replaceAll example
input.split("\\s+") private static final Pattern SPACES_PATTERN = Pattern.compile( "\\s+" );
SPACES_PATTERN.split( input );

Optimizing simple cases

Patterns are a good solution when you need to match data against complex pattern. They allow you to express parsing logic in a simple form, thus reducing code complexity. Always use patterns for such cases – do not reinvent a wheel!

But there are a lot of simple cases where most developers still tend to use patterns, for example, retrieving parts of “key=value” string or splitting a string which parts are space-separated. You’d better still use patterns for such cases in not performance critical parts of your code, but it worth considering processing simple cases ‘manually’ in critical modules.

String.split method was optimized in Java 7. If pattern is one character long and that character is not a regex-special one, when a special code branch would be executed, which is calling indexOf(int, int) in a loop. In most cases it doesn’t worth optimizing String.split(one-character String) further. Advice to replace String method with compiled Pattern is also not applicable Java 7 for such single character patterns – it would only decrease performance. Here is a times table for splitting “key=value” string with “=” pattern ten million times in Java 6 and 7.

  Java 6 Java 7
String.split 4.150 sec 1.183 sec
Pattern.split 2.601 sec 2.000 sec

If you have a simple case, like parsing lines of ini file, it is still worth to parse data manually. A simple call of indexOf(char) method is all you need to parse such string:

1
2
3
4
5
6
final int eq = str.indexOf( '=' );
if ( eq == -1 )
    return new String[] { str };
else
    return new String[] { str.substring( 0, eq ), str.substring( eq + 1 ) };
            
final int eq = str.indexOf( '=' );
if ( eq == -1 )
    return new String[] { str };
else
    return new String[] { str.substring( 0, eq ), str.substring( eq + 1 ) };
            

Support for comments starting with ‘#’ will require one more String.indexOf(char) call, but the same effort would be required for any other splitting approaches. Let’s add this approach results to the previous table.

  Java 6 Java 7
String.split 4.150 sec 1.183 sec
Pattern.split 2.601 sec 2.000 sec
String.indexOf 0.260 sec 0.230 sec

Next simple case. We need to split a space-separated string. The following function will replace String.split( "\\s+" ):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
private static String[] spaceSplit( final String s )
{
    final List<String> lst = new LinkedList<String>( );
    int p = 0;
    int start = 0;
    while ( p < s.length() )
    {
        if ( s.charAt( p ) <= ' ' )
        {
           if ( p == 0 )
              lst.add( "" );
           else if ( start < p )
              lst.add( s.substring( start, p ) );
           start = p + 1;
        }
        ++p;
    }
    if ( start < s.length() )
       lst.add( s.substring( start ) );
    return lst.toArray( new String[ lst.size() ] );
}
private static String[] spaceSplit( final String s )
{
    final List<String> lst = new LinkedList<String>( );
    int p = 0;
    int start = 0;
    while ( p < s.length() )
    {
        if ( s.charAt( p ) <= ' ' )
        {
           if ( p == 0 )
              lst.add( "" );
           else if ( start < p )
              lst.add( s.substring( start, p ) );
           start = p + 1;
        }
        ++p;
    }
    if ( start < s.length() )
       lst.add( s.substring( start ) );
    return lst.toArray( new String[ lst.size() ] );
}

For testing, we will use a string consisting of all numbers between 0 and 100,000 separated by a space character. Now, let’s call 1000 times String.split, Pattern.split and spaceSplit methods on this string.

  Java 6 Java 7
String.split 11.10 sec 10.96 sec
Pattern.split 10.72 sec 10.78 sec
spaceSplit 3.25 sec 3.20 sec

By the way, what will happen if we know that string parts are separated by exactly one space character (just one space as a split pattern)? We are still using above mentioned test string and call splitting methods 1000 times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
private static String[] singleSpaceSplit(final String s)
{
    final LinkedList<String> lst = new LinkedList<String>( );
    int prev = 0;
    int p = 0;
    while ( ( p = s.indexOf( ' ', prev ) ) != -1 )
    {
        lst.add( s.substring( prev, p ) );
        prev = p + 1;
    }
    if ( lst.isEmpty() )
        return new String[]{ s };
    if ( prev > s.length() )
        lst.add( s.substring( prev ) );
    while ( !lst.isEmpty() && lst.getLast().isEmpty() )
        lst.removeLast();
    return lst.toArray( new String[ lst.size() ] );
}
        
private static String[] singleSpaceSplit(final String s)
{
    final LinkedList<String> lst = new LinkedList<String>( );
    int prev = 0;
    int p = 0;
    while ( ( p = s.indexOf( ' ', prev ) ) != -1 )
    {
        lst.add( s.substring( prev, p ) );
        prev = p + 1;
    }
    if ( lst.isEmpty() )
        return new String[]{ s };
    if ( prev > s.length() )
        lst.add( s.substring( prev ) );
    while ( !lst.isEmpty() && lst.getLast().isEmpty() )
        lst.removeLast();
    return lst.toArray( new String[ lst.size() ] );
}
        
  Java 6 Java 7
String.split(” “) 5.67 sec 2.37 sec
Pattern.split 5.34 sec 5.51 sec
singleSpaceSplit 2.94 sec 2.51 sec

As you may see, splitting by a single space character is working 4 times faster than splitting by multiple space characters in Java 7 ( 2.37 sec against 10.96 sec ). The last table also shows how fruitless are attempts to write single-character splitting methods manually when you don’t know any other properties of your input strings except separator character – hand-written method is working slightly slowly than JDK 7 optimization for this case.

Summary

Try to follow these rules while using regexp-related methods:

  • Always (or nearly always) replace String.matches, split, replaceAll, replaceFirst methods with Matcher and Pattern methods.
  • In Java 7 splitting by a single not regex-special character string is optimized in String.split method. Always use String.split to split such strings in Java 7 unless you know that input strings have limited (or better – fixed) number of separator characters.
  • In all other simple cases consider handwriting parsing methods for simple situations in the time-critical code. You can easily gain 10 times speedup by replacing Pattern methods with handcrafted methods.

2 thoughts on “Regexp-related methods of String

  1. Pingback: Use case: FIX message processing. Part 1: Writing a simple FIX parser  - Java Performance Tuning Guide

  2. Pingback: Oracle Java Mission Control Overview  - Java Performance Tuning Guide

Leave a Reply

Your email address will not be published. Required fields are marked *