Seeking inside file performance under BTRFS with LZO compression

I am planning to use btrfs on a 50 TB RAID6 array and I want to enable lzo compression.

This is for bioinformatics setup where lots of seeking within large (1 TB -- 20 TB) files occur. (The software gets only small chunks of data scattered across the file).

What worries me is that I don't understand how seeking is performed on compressed filesystems like btrfs. Does the file need to be decompressed from the beginning till the sought-after position first? That would have a huge negative impact in my setup.

Or a more general question: does the seek-time scale with file size the same way as on non-compressed filesystem or does it get worse, e.g. O(file_length)


Solution 1:

Random seek times will be roughly O(1) like uncompressed file systems as well, but with the caveat that up to 128 KiB of data is compressed together so to read just a single byte, all the data in that 128 KiB block will have to be read and decompressed. Depending on the access pattern, this can have a somewhat large performance impact, but you need to benchmark this with your specific application and dataset.

(Source)

Solution 2:

There's a lot of misinformation about FS compression on the internet, and here on Stackoverflow. Filesystem compression is done at the block level (or chunk level, depending on the device), not at the file abstraction level, so ostensibly seeking is the same -- file seeking is done in terms of blocks, not in terms of compressed bits. What that means is that the compression itself is not exposed to the user level programs. So you don't have to think about it or worry about it.

A "super overly simple" way to to visualize: x/0 are blocks, groups of blocks in a file. uncompressed files & blocks: [xxx][xxx][xxx][xxx] compressed files & blocks: [xx]0[xx]0[xx]0[xx]000 In truth it isn't quite like that, but the file inodes will point to compressed blocks and transparently leave out space the file doesn't need.

In principle, there's no current reason not to enable fs-compression. Apart from a few outlying cases, the performance of fs-compression is strictly better than uncompressed reads. For bioinformatic data, which I've also worked with, you sometimes want to maximize your read bandwidth, and compression will achieve that -- i.e. the uncompressed data read speeds will exceed the controller+interface limits. (N compressed bits at sata III/raid becomes N * compression ratio bits). Don't pay attention to any of the nonsense that people say about latency, slowing the processor down, etc. The CPU is 1000x faster than disk reading.

For some performance benchmarks, here: http://www.phoronix.com/scan.php?page=article&item=btrfs_lzo_2638&num=2

Another confusion can emerge if we mix the file level compression (i.e. gzip or xz, etc) with the filesystem level compression. In those cases, yes, file seeking is non-deterministic and absolute data locations in the file are not strictly available without decompressing the previous byte stream just to locate the dictionary definition offsets within the file. So with fs-level compression, you retain seeking at the loss of some compressibility.

As an aside, the reason block level/fs compression is usually (and historically) disabled is because it can increase fragmentation within a file, especially with mid-file-writes. For old drives, or drives with database files, the fragmentation itself can incur a performance penalty (this is still true with ssd's, but because of re-write/erase block cycle, not because of the linearly moving read head). If this is a giant bio-informatic stream, then midwrites may not be a problem.

In general, seek time scales as a function of the inode and filesystem layout. Not filesize. E.g. if you have two files, big size X and bigger size Y, neither of which fit within the disk readahead and cache, nor can be read in a single inode read, then the time to reach position x in X is approximately equal to the time to reach position y in Y, where x < y . There are cases where it can appear to be different, but those are for other uncontrolled factors, such as rotational position on the spinning platter. Or files X and Y are being opened and read as streams. Then all of X up to pos x has to be read, and the same for Y. But that is not a function of the filesystem. An fseek() command directly into different file positions will reveal similar seek times. (Again position on platter dependent).

HTH.