Tag Archives: data buffering

Single file vs multi file storage

by Mikhail Vorontsov

Introduction

This article will discuss how badly your application may be impacted by a data storage containing a huge number of tiny files. While it is obvious that it is better to keep the same amount of data in a single file rather than in a million of tiny files, we will see what modern operating systems offer to alleviate this problem.

This article will describe data layout problem I had recently faced. Suppose we have a database containing a few million records with average size of 10 KB, but no hard upper limit. Each of these records has a unique long id. Besides that, there is some other data, but its size is unlikely to exceed ~5% of space occupied by records.

For some reason there was a decision to use ordinary files as a data storage. The original implementation used one file per entry. Files were separated into directories according to auxiliary attributes. All this data was read into memory on startup. There are other analytical batch tools which also read the whole database in memory before data processing.

The system required very long time to startup due to necessity to read all these tiny files from disk. My task was to solve this bottleneck.

Continue reading