Linux file system for a big file server

Alot of people are suggesting ZFS. But ZFS is not available natively under Linux except through fuse. I wouldn't recommend this for your situation where performance is likely to be important.

Unfortunately, ZFS will never be available as a native kernel module unless licencing issues are sorted out somehow.

XFS is good, but some people have reported corruption issues and I can't really comment on that. I've played with small XFS partitions and not had these problems but not in production.

ZFS has too many advantages & useful features that cannot be ignored though. In summary they are (see ZFS Wiki for a full description of what they mean):

  • Data integrity
  • Storage pools
  • L2ARC
  • High capacity
  • Copy on write
  • Snapshots & clones
  • Dynamic striping
  • Variable block sizes
  • Lightweight filesystem creation
  • Cache management
  • Adaptive endianness
  • Deduplication
  • Encrypion

So how do we get around it? My suggested alternative which may suit your situation is to consider nexenta. This is an Open Solaris kernel with GNU userland tools running on top. Having an Open Solaris kernel means having ZFS available natively.


You should give a try to XFS, fit well in your requirements :

XFS is a 64-bit file system. It supports a maximum file system size of 8 exbibytes minus one byte, though this is subject to block limits imposed by the host operating system. On 32-bit Linux systems, this limits the file and file system sizes to 16 tebibytes.


Your easiest option is to use XFS. A lot of the bad experiences around XFS are based on old versions and desktop hardware problems that I don't think are really relevant for new deployments onto standard quality server hardware. I wrote a blog post about this subject that may help you sort out the current situation. There are multiple busy XFS database installations with hundreds of users and terabytes of data I help manage. They're all on the Debian Lenny kernel (2.6.26) or later and I haven't heard a hint of trouble with them in years. I wouldn't use XFS with any earlier kernel than that. I have heard some direct reports of people seeing strange XFS behavior still when the system runs out of memory or disk space; I haven't seen that myself yet though.

The only other reasonable option is to use ext4 with some hacking to support larger filesystems. I wouldn't expect that to have a very different reliability level. I've had to recover data from multiple broken ext4 systems that ran into kernel bugs, so far ones all fixed upstream but not in the distributor's kernel at that time. ext4 has its own set of metadata issues like Delayed allocation data loss, things that were less likely to happen on ext3. I would estimate the odds of you hitting an ext4 bug would be even higher than normal if you're forcing it over the normal size limit, simply because it seems more likely you'll be hitting a less well tested new code path at some point.

Alternative idea is to just use safer and boring ext3, accept the 16TB limit, and partition things better so no single filesystem has to be that large.

One loose end related to journal issues. You didn't talk about how all these drives are going to be connected. Make sure you understand the implication of any write caching that's in your storage chain here. Either disable it or make sure the filesystem is flushing the cache out. I've stashed some resources about that at Reliable Writes if that's not something you're checking yet.

Drives suck. RAID arrays suck. Filesystems suck. Multiple failures happen. I'm glad to see you're already thinking about backups; going from good to great reliability on storage requires more than just RAID and some spare drives. Redundancy costs something at every level, and the money for hardware vs. software complexity one is tricky to navigate. And watch your performance expectations. While a RAID array like you're considering will easily do hundreds of MB/s, all it takes is two concurrent readers seeking the disk around constantly to drop that to only a few MB/s instead. I can easily crush a 24 disk RAID10 array such that it only delivers <5MB/s against a benchmark workload. One thing that helps there is to make sure you tweak readahead upward if multiple streaming readers are possible.