java.util.zip.CRC32 and java.util.zip.Adler32 performance

by Mikhail Vorontsov

Checksum is a function from an arbitrary size input byte array to a small number of bytes (integer in case of CRC32 and Adler32). Its most important property is that a small change in the input data (even one byte is increased by one) means a large change in the checksum. Checksums are usually calculated in one of two following scenarios:

  • A file was received with a separately received checksum. We need to calculate a checksum on the file contents and compare it with the received checksum to ensure that the file was not changed during transmission.
  • A file is being read from an input device and a checksum is being calculated on its contents. Partial file contents are being somehow processed. In some occasions connection with the input device may be broken and retransmission is required. We may use the last calculated checksum and its file position in order to check that the same data is being received during retransmission until the last known file position in order not to reprocess the retransmitted file and continue processing after that position.

The most well-known checksum implementation from standard JDK is java.util.zip.CRC32. A lot of developers think that it is the only standard checksum implementation (a lot of them would tell ‘standard CRC implementation’). Actually, there is one more checksum implementation called java.util.zip.Adler32. Both CRC32 and Adler32 implement the same interface – java.util.zip.Checksum.

The difference between two these classes is that Adler32 is a bit weaker than CRC32. Nevertheless, if your inputs are longer than a couple of thousand bytes, you can ignore difference in checksum quality. What’s more important is that Adler32 is calculated faster that CRC32. Here is a table comparing calculation of two these checksums on 500 megabyte archive (it is read into memory in advance).

  Adler32, 500M in one call CRC32, 500M in one call Adler32, 500M in 10 byte blocks CRC32, 500M in 10 byte blocks Adler32, 500M in 1000 byte blocks CRC32, 500M in 1000 byte blocks
Java 6 0.233 sec 1.526 sec 3.585 sec 4.119 sec 0.253 sec 1.545 sec
Java 7 0.227 sec 0.666 sec 3.684 sec 3.691 sec 0.267 sec 0.699 sec

As you may see, Adler32 on long enough data blocks is 5-6 times faster than CRC32 in Java 6. In Java 7 CRC32 was definitely optimized and it is only 2-3 times slower than Adler32. On smaller data blocks cost of JNI calls (both checksums are implemented in native code) become a dominating factor.

One may say that 0.3 sec or 1.5 sec on 500 megabytes is not a difference worth paying attention. It is not quite true – modern hard drives are able to read data at 100M/sec rate, so it will take no more than 5 seconds to read half of gigabyte. Extra one second (1.5-0.3 sec) is a 20% performance decrease in this case. To make it more clear, you may read it as ‘our system is capable to read 500,000 messages per second with CRC32 and 600,000 messages per second with Adler32’ (put your own numbers here).

Summary

If you can choose which checksum implementation you can use – try Adler32 first. If its quality is sufficient for you, use it instead of CRC32. In any case, use Checksum interface in order to access Adler32/CRC32 logic.

Try to update checksum by at least 500 byte blocks. Shorter blocks will require a noticeable time to be spent in JNI calls.


Leave a Reply

Your email address will not be published. Required fields are marked *