Tips on efficiently storing 25TB+ worth million files in filesystem
Solution 1:
I'm not a distributed file system ninja, but after consolidating as many drives I can into as few machines as I can, I would try using iSCSI to connect the bulk of the machines to one main machine. There I could consolidate things into hopefully a fault tolerant storage. Preferably, fault tolerant within a machine (if a drive goes out) and among machines (if a whole machine is power off).
Personally I like ZFS. In this case, the build in compression, dedupe and fault tolerance would be helpful. However, I'm sure there are many other ways to compress the data while making it fault tolerant.
Wish I had a real turnkey distributed file solution to recommend, I know this is really kludgey but I hope it points you in the right direction.
Edit: I am still new to ZFS and setting up iSCSI, but recalled seeing a video from Sun in Germany where they were showing the fault tolerance of ZFS. They connected three USB hubs to a computer and put four flash drives in each hub. Then to prevent any one hub from taking the storage pool down they made a RAIDz volume consisting of one flash drive from each hub. Then they stripe the four ZFS RAIDz volumes together. That way only four flash drive were used for parity. Next of course the unplugged one hub and that degraded every zpool, but all the data was available. In this configuration up to four drive could be lost, but only if any two drive were not in the same pool.
If this configuration was used with the raw drive of each box, then that would preserve more drives for data and not for parity. I heard FreeNAS can (or was going to be able to) share drives in a "raw" manner via iSCSI, so I presume Linux can do the same. As I said, I'm still learning, but this alternate method would be less wasteful from drive parity stand point than my previous suggestion. Of course, it would rely on using ZFS which I don't know if would be acceptable. I know it is usually best to stick to what you know if you are going to have to build/maintain/repair something, unless this is a learning experience.
Hope this is better.
Edit: Did some digging and found the video I spoke about. The part where they explain spreading the USB flash drive over the hubs starts at 2m10s. The video is to demo their storage server "Thumper" (X4500) and how to spread the disks across controllers so if you have a hard disk controller failure your data will still be good. (Personally I think this is just a video of geeks having fun. I wish I had a Thumper box myself, but my wife wouldn't like me running a pallet jack through the house. :D That is one big box.)
Edit: I remembered comming across a distributed file system called OpenAFS. I hadn't tried it, I had only read some about it. Perhaps other know how it handles in the real world.
Solution 2:
First, log files can be compressed at really high ratios. I find my log files compress at a 10:1 ratio. If they compress to even a 5:1 ratio, that's only 5GB, or 20% of your storage capacity.
Given that you have more than enough storage, the specific compression algorithm isn't too important. You could...
- Use zip files if Windows users will be accessing the files directly.
- Use gzip if they'll be accessed through Linux and quick decompression is important.
- Use bzip2 if they'll be accessed through Linux and it's important to have the smallest possible files.
The bigger question is: how are you going to provide your users with easy access to these files? Part of this depends on how your machines are configured.
If you can put enough storage into a single machine, then you can do something extremely simple, like a read-only Windows file share. Just organize the files in subdirectories, and you're ready to go.
If you can't create a single file server for these files, then you may find that you need a distributed filesystem. Windows has a Distributed File System (DFS) which might suit your needs.
If your needs are more advanced, you may want a web application as a front-end where your users can browse and download log files. In this case, I recommend using MogileFS, which is a distributed file system designed to be used with a front-end application server. It's very easy to integrate with most web programming languages. You can't mount it as a shared drive on your computer, but it's top-notch as a data store for a web application.
Solution 3:
lessfs is a deduplicating, compressing file system. While it won't solve whole problem, it may be worth a look to look at as a backend.
Solution 4:
export these folders via NFS
mount them on a single machine with apache running (under document root) as a tree
use zip to compress them- good compress ratio, zip can be opened from all OSes
list files in Apache -so you are giving users readonly access (log files are not suppose to be edit, right)