driverpool overhead for millions of small files?

AyaNeko · July 6, 2024

Hi, i have been using drivepool for a long time now, very happy with it.
I now have some new requirements which i never had to deal with so i have 2 questions:

first question:
I will have couple of millions of small files, some less than 32kb, some around 250kb, in total it will around 30 terrabytes in total space used, yes it's crazy, but please don't ask why. What i need to know is, how much will drivepool's overhead affect random read performance? It will be a bunch of data blobs and hashes sorted in a couple of subfolders (starting from 00 to zz, the first 2 letters of the hashes of those blobs as filenames) with milllions of hashblobs inside those folders. It will be randomly accessed and traversed and i need it to be as fast as possible. I need it to be at least as fast as baremetal (directly on the HDD) or if possible faster, I think when drivepool spreads those files out over multiple drives i am hoping of higher IOPS and throughput, but that all depends on how much the overhead drivepool will have in such a scenario.

second question:
If i make a VM with a couple of lets say 500Gb vhdx hard disk files stored on drivepool. While the VM is running those files are locked. Can drivepool still balance them somehow while the VM is running?

Thank you in advance!

Shane · July 7, 2024

A1: At least to an extent, in my experience the more you can ensure that the files are read concurrently from different drives the more DrivePool can beat a single drive (so long as your bus is big enough). Conversely, it can do a little-to-significantly worse (depending on access patterns) than a single drive if you can't ensure that.

Using 2x (or higher, YMMV) duplication and enabling read-striping on the pool can help greatly with this (e.g. if it goes to read two files and they're on the same disk, then if you'd turned on 2x duplication and striping it could've read each file from a different disk).

Some older summing/hashing applications may encounter problems with read-striping (I suspect because they try physical calls and DrivePool isn't physical). You'll need to test before going "live".

Incidentally if you're going to have millions of files to which you need fast access then make sure you've got enough RAM to keep the file table fully loaded (e.g. my disks currently contain ~3.2M files after duplication, using up ~3.5GB for the Metafile in RAM).

A2: DrivePool does not balance files when they are locked. Also do not use DrivePool duplication with VM images unless real-time duplication is both enabled and completed prior to running, to avoid consistency errors.

Sign In

driverpool overhead for millions of small files?

Question

AyaNeko

1 answer to this question

Recommended Posts

Shane

Join the conversation

Browse

Activity