How to store terabytes of large, randomly accessed files?

If you only serve this data locally, you could easily assemble a single server with a couple of terabytes of storage using off-the-shelf components. Teaming up a couple of gigabit NICs could provide you the network troughput.

If the content have to be served over larger distances, it might be better to replicate the data across several boxes. If you can afford it, you could fully replicate the data, and if files never get overwritten, crude timestamp-based replication scripts could work.

Otherwise you could look at parallel filesystem implementations; if you want a free one, you could look at Lustre (for linux) or Hadoop (multiplatform).


All of these are significant:

1) lots of RAM

2) multiple network cards and/or frontends to reduce bottlenecks

3) reverse proxy server, such as Squid (see eg. http://www.visolve.com/squid/whitepapers/reverseproxy.php ) or Varnish

4) RAID setup for disks (striped or stripes/mirrors combo possibly)

5) choice of correct filesystem and, yes, block size. XFS used to be good performer for large amounts of data, probably now ZFS is better.

These all should help. How much and what of this needs to be implemented you should be able to calculate based on your target requirements (ie. total net bandwidth you want to utilize, thoroughput of single card, max thoroughput of your disks unraided and raided, etc.)