Tag Archives: hdd

Single file vs multi file storage

by Mikhail Vorontsov


This article will discuss how badly your application may be impacted by a data storage containing a huge number of tiny files. While it is obvious that it is better to keep the same amount of data in a single file rather than in a million of tiny files, we will see what modern operating systems offer to alleviate this problem.

This article will describe data layout problem I had recently faced. Suppose we have a database containing a few million records with average size of 10 KB, but no hard upper limit. Each of these records has a unique long id. Besides that, there is some other data, but its size is unlikely to exceed ~5% of space occupied by records.

For some reason there was a decision to use ordinary files as a data storage. The original implementation used one file per entry. Files were separated into directories according to auxiliary attributes. All this data was read into memory on startup. There are other analytical batch tools which also read the whole database in memory before data processing.

The system required very long time to startup due to necessity to read all these tiny files from disk. My task was to solve this bottleneck.

Continue reading

I/O bound algorithms: SSD vs HDD

by Mikhail Vorontsov

This article will investigate an impact of modern SSDs on the I/O bound algorithms of HDD era.

Improved write speed of SSD

Modern SSD provide read/write speeds up to 500Mb/sec. Compare it to approximately 100Mb/sec cap on the speed of modern HDD. It means that your application has to produce the output 5 times faster than before in order to still be I/O bound.

Let’s make 3 tests:

  1. Fill an 8 Gb file with a repeating sequence of 1024 bytes using a BufferedOutputStream with 32K buffer size. Data will be written in a loop, no extra processing is done inside the loop ( testWriteNoProcessing method from the following code snippet ).
  2. Fill an 8 Gb file with a sequence of 1024 bytes which is recomputed before writing it on the every iteration. Data will be written using a BufferedOutputStream with 32K buffer size ( testWriteSimple ).
  3. Same as previous test, but data will not be written to disk. This test will estimate how long does it take to prepare the data to write.

Continue reading