java.io.BufferedInputStream and java.util.zip.GZIPInputStream

by Mikhail Vorontsov

BufferedInputStream and GZIPInputStream are two classes often used while reading some data from a file (the latter is widely used at least in Linux). Buffering input data is generally a good idea which was described in many Java performance articles. There are still a couple of issues worth knowing about these streams.

When not to buffer

Buffering is done in order to reduce number of separate read operations from an input device. Many developers often forget about it and always wrap InputStream inside BufferedInputStream, like

1
final InputStream is = new BufferedInputStream( new FileInputStream( file ) );
final InputStream is = new BufferedInputStream( new FileInputStream( file ) );

The short rule on whether to use buffering or not is the following: you don’t need it if your data blocks are long enough (100K+) and you can process blocks of any length (you don’t need any guarantees that at least N bytes is available in the buffer before starting processing). In all other cases you need to buffer input data.

The simplest example when you don’t need buffering is a manual file copy process.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public static void copyFile( final File from, final File to ) throws IOException {
    final InputStream is = new FileInputStream( from );
    try
    {
        final OutputStream os = new FileOutputStream( to );
        try
        {
            final byte[] buf = new byte[ 8192 ];
            int read = 0;
            while ( ( read = is.read( buf ) ) != -1 )
            {
                os.write( buf, 0, read );
            }
        }
        finally {
            os.close();
        }
    }
    finally {
        is.close();
    }
}
public static void copyFile( final File from, final File to ) throws IOException {
    final InputStream is = new FileInputStream( from );
    try
    {
        final OutputStream os = new FileOutputStream( to );
        try
        {
            final byte[] buf = new byte[ 8192 ];
            int read = 0;
            while ( ( read = is.read( buf ) ) != -1 )
            {
                os.write( buf, 0, read );
            }
        }
        finally {
            os.close();
        }
    }
    finally {
        is.close();
    }
}

Note 1: it is difficult to measure file copy performance, because it is heavily affected by operating system write caching. On my machine time to copy 4.5G file to the same disk varied between 68 and 107 sec.

Note 2: file copying is usually done via Java NIO, using FileChannel.transferTo or transferFrom methods. Using these methods does not require constant switching between kernel and user modes (to transfer read data to byte buffer in user Java program and then copy it back to output file via kernel calls). Instead they transfer as much data as possible (up to 231 – 1 bytes) in kernel mode, not returning to user code for as long as possible. As a result, Java NIO code will use less CPU cycles, leaving them to other programs. The difference, though, could be noticed only in high load environments (on my machine it is 4% total CPU load in NIO mode vs 8-9% in old stream mode). Here is a possible Java NIO implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
private static void copyFileNio( final File from, final File to ) throws IOException {
    final RandomAccessFile inFile = new RandomAccessFile( from, "r" );
    try
    {
        final RandomAccessFile outFile = new RandomAccessFile( to, "rw" );
        try
        {
            final FileChannel inChannel = inFile.getChannel();
            final FileChannel outChannel = outFile.getChannel();
            long pos = 0;
            long toCopy = inFile.length();
            while ( toCopy > 0 )
            {
                final long bytes = inChannel.transferTo( pos, toCopy, outChannel );
                pos += bytes;
                toCopy -= bytes;
            }
        }
        finally {
            outFile.close();
        }
    }
    finally {
        inFile.close();
    }
}
private static void copyFileNio( final File from, final File to ) throws IOException {
    final RandomAccessFile inFile = new RandomAccessFile( from, "r" );
    try
    {
        final RandomAccessFile outFile = new RandomAccessFile( to, "rw" );
        try
        {
            final FileChannel inChannel = inFile.getChannel();
            final FileChannel outChannel = outFile.getChannel();
            long pos = 0;
            long toCopy = inFile.length();
            while ( toCopy > 0 )
            {
                final long bytes = inChannel.transferTo( pos, toCopy, outChannel );
                pos += bytes;
                toCopy -= bytes;
            }
        }
        finally {
            outFile.close();
        }
    }
    finally {
        inFile.close();
    }
}

Buffer size

Default buffer size in a BufferedInputStream is 8192 bytes. Buffer size is, actually, an average size of block to be read from the input device. That’s why it often worth to explicitly increase it at least to 64K (65536), and in case of very large input files – to something between 512K and 2M in order to further reduce a number of disk reads. Many experts also advise to set it equal to multiplier of 4096 – one of common disk sector sizes, so, don’t set buffer size to, for example, 125000. Set it to 131072 (128K) instead.

java.util.zip.GZIPInputStream is an input stream capable of reading gzip files. It is usually used so:

1
final InputStream is = new GZIPInputStream( new BufferedInputStream( new FileInputStream( file ) ) );
final InputStream is = new GZIPInputStream( new BufferedInputStream( new FileInputStream( file ) ) );

Such initialization is good enough, but BufferedInputStream is redundant here, because GZIPInputStream has its own built-in buffer. There is a byte buffer inside it (actually, it is a member of InflaterInputStream), which is used to read compressed data from the underlying stream and pass it to an inflater. Default size of this buffer is 512 bytes, so, it must be set to a higher value in any real world program. More optimal way to use GZIPInputStream is something like:

1
final InputStream is = new GZIPInputStream( new FileInputStream( file ), 65536 );
final InputStream is = new GZIPInputStream( new FileInputStream( file ), 65536 );

BufferedInputStream.available

BufferedInputStream.available has a possible performance problem depending on what you really expect to receive from it. Its standard implementation returns a sum of number of available bytes in BufferedInputStream own internal buffer and result of underlying input stream available() call. So, it tries to return as exact value as it is possible. But in many cases all what user wants to know is if there is something left in the buffer ( available() > 0 ). In such case, if a BufferedInputStream has even one byte left in its own buffer, we don’t really need to query underlying stream. This is especially important if we have a FileInputStream wrapped in a BufferedInputStream – such optimization will save us a disk access in FileInputStream.available().

Fortunately, we can fix this problem rather easy. BufferedInputStream is not a final class, so we can extend it and override available method. We can take a look at JDK sources to get a starting point. From them we may also study that Java 6 implementation had a bug – if a sum of BufferedInputStream available bytes and underlying stream available() result is greater than Integer.MAX_VALUE, a negative result will be returned due to overflow. This was fixed in Java 7.

Here is our improved implementation which returns either a number of available bytes in BufferedInputStream own buffer, or, if nothing is left in it, underlying stream available() is called and its result is returned. In most cases such implementation will extremely seldom call underlying stream available(), because when BufferedInputStream buffer was read till the end, this class would read more data from the underlying stream. So, we will call underlying stream available() only at the end of the input file.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
public class FasterBufferedInputStream extends BufferedInputStream
{
    public FasterBufferedInputStream(InputStream in, int size) {
        super(in, size);
    }
 
    //This method returns positive value if something is available, otherwise it will return zero.
    public int available() throws IOException {
        if (in == null)
            throw new IOException( "Stream closed" );
        final int n = count - pos;
        return n > 0 ? n : in.available();
    }
}
public class FasterBufferedInputStream extends BufferedInputStream
{
    public FasterBufferedInputStream(InputStream in, int size) {
        super(in, size);
    }

    //This method returns positive value if something is available, otherwise it will return zero.
    public int available() throws IOException {
        if (in == null)
            throw new IOException( "Stream closed" );
        final int n = count - pos;
        return n > 0 ? n : in.available();
    }
}

To test this implementation, I tried to read 4.5G file using standard and improved BufferedInputStream with 64K buffer calling available() once per each 512 or 1024 bytes. Clean testing will require an operating system restart after each test in order to flush disk cache. I decided to read a file in warm-up stage and test performance of two approaches when a file is already in the disk cache. Test has revealed that standard class run time was linearly dependant on number of available calls, improved one looked like it was ignoring such calls:

  standard, once per 512 bytes improved, once per 512 bytes standard, once per 1024 bytes improved, once per 1024 bytes
Java 6 17.062 sec 2.11 sec 9.592 sec 2.047 sec
Java 7 17.337 sec 2.125 sec 9.748 sec 2.044 sec

This a test source code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
private static void testRead( final InputStream is ) throws IOException {
    final long start = System.currentTimeMillis();
    final byte[] buf = new byte[ 512 ];   //or 1024 bytes
    while ( true )
    {
        if ( is.available() == 0 ) break;
        is.read( buf );
    }
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Impl: " + is.getClass().getCanonicalName() + " time = " + time / 1000.0 + " sec");
}
 
Call it using
final InputStream is1 = new BufferedInputStream( new FileInputStream( file ), 65536 );
and
final InputStream is2 = new FasterBufferedInputStream( new FileInputStream( file ), 65536 );
private static void testRead( final InputStream is ) throws IOException {
    final long start = System.currentTimeMillis();
    final byte[] buf = new byte[ 512 ];   //or 1024 bytes
    while ( true )
    {
        if ( is.available() == 0 ) break;
        is.read( buf );
    }
    final long time = System.currentTimeMillis() - start;
    System.out.println( "Impl: " + is.getClass().getCanonicalName() + " time = " + time / 1000.0 + " sec");
}

Call it using
final InputStream is1 = new BufferedInputStream( new FileInputStream( file ), 65536 );
and
final InputStream is2 = new FasterBufferedInputStream( new FileInputStream( file ), 65536 );

Summary

  • Both BufferedInputStream and GZIPInputStream have internal buffers. Default size for the former one is 8192 bytes and for the latter one is 512 bytes. Generally it worth increasing any of these sizes to at least 65536.
  • Do not use a BufferedInputStream as an input for a GZIPInputStream, instead explicitly set GZIPInputStream buffer size in the constructor. Though, keeping a BufferedInputStream is still safe.
  • If you have a new BufferedInputStream( new FileInputStream( file ) ) object and you call its available method rather often (for example, once or twice per each input message), consider overriding BufferedInputStream.available method. It will greatly speed up file reading.

2 thoughts on “java.io.BufferedInputStream and java.util.zip.GZIPInputStream

  1. Pingback: Java性能调优指南—java.io.BufferedInputStream和java.util.zip.GZIPInputStream | 博客 | NOTEYUN.COM

  2. Pingback: Java性能调优指南 (1) | 天天三国杀

Leave a Reply

Your email address will not be published. Required fields are marked *