Checksum is a function from an arbitrary size input byte array to a small number of bytes (integer in case of CRC32 and Adler32). Its most important property is that a small change in the input data (even one byte is increased by one) means a large change in the checksum. Checksums are usually calculated in one of two following scenarios:
- A file was received with a separately received checksum. We need to calculate a checksum on the file contents and compare it with the received checksum to ensure that the file was not changed during transmission.
- A file is being read from an input device and a checksum is being calculated on its contents. Partial file contents are being somehow processed. In some occasions connection with the input device may be broken and retransmission is required. We may use the last calculated checksum and its file position in order to check that the same data is being received during retransmission until the last known file position in order not to reprocess the retransmitted file and continue processing after that position.
The most well-known checksum implementation from standard JDK is
java.util.zip.CRC32. A lot of developers think that it is the only standard checksum implementation (a lot of them would tell ‘standard CRC implementation’). Actually, there is one more checksum implementation called
Adler32 implement the same interface –
The difference between two these classes is that
Adler32 is a bit weaker than
CRC32. Nevertheless, if your inputs are longer than a couple of thousand bytes, you can ignore difference in checksum quality. What’s more important is that
Adler32 is calculated faster that
CRC32. Here is a table comparing calculation of two these checksums on 500 megabyte archive (it is read into memory in advance).
|Adler32, 500M in one call||CRC32, 500M in one call||Adler32, 500M in 10 byte blocks||CRC32, 500M in 10 byte blocks||Adler32, 500M in 1000 byte blocks||CRC32, 500M in 1000 byte blocks|
|Java 6||0.233 sec||1.526 sec||3.585 sec||4.119 sec||0.253 sec||1.545 sec|
|Java 7||0.227 sec||0.666 sec||3.684 sec||3.691 sec||0.267 sec||0.699 sec|
As you may see,
Adler32 on long enough data blocks is 5-6 times faster than
CRC32 in Java 6. In Java 7
CRC32 was definitely optimized and it is only 2-3 times slower than
Adler32. On smaller data blocks cost of JNI calls (both checksums are implemented in native code) become a dominating factor.
One may say that 0.3 sec or 1.5 sec on 500 megabytes is not a difference worth paying attention. It is not quite true – modern hard drives are able to read data at 100M/sec rate, so it will take no more than 5 seconds to read half of gigabyte. Extra one second (1.5-0.3 sec) is a 20% performance decrease in this case. To make it more clear, you may read it as ‘our system is capable to read 500,000 messages per second with CRC32 and 600,000 messages per second with Adler32’ (put your own numbers here).
If you can choose which checksum implementation you can use – try
Adler32 first. If its quality is sufficient for you, use it instead of
CRC32. In any case, use
Checksum interface in order to access
Try to update checksum by at least 500 byte blocks. Shorter blocks will require a noticeable time to be spent in JNI calls.