Seeking distributed, fault-tolerant, networked block storage [closed]
I'm looking for a distributed, fault-tolerant network storage system which exposes block devices (not filesystems) on the clients.
- A client's block device should write simultaneously to several storage nodes
- A client's block device should not fail as long as not all storage nodes backing it went down
- The master should automatically redistribute storages' data when a storage node fails or gets added/ removed
- A single master (which is for metadata only) is fine
So ideally the architecture would be very similar to moosefs (http://www.moosefs.org/) but instead of exposing a real filesystem mounted using a fuse client it'd expose block devices on the clients.
I know of iscsi and drbd but both don't seem to offer what I'm looking for. Or am I missing something?
Based on the above requirements, Ceph may be what you're after. http://ceph.newdream.net/
Ceph provides a distributed, POSIX-compliant file system, that you can mount as a block device using the Rados block device. This is implemented directly in modern Linux kernels (2.6.37+).
There's even a Qemu/KVM storage driver which means you can mount Ceph filesystems as a virtual machine disk.
Major web-hosting company Dreamhost ( http://dreamhost.com/ ) relies on Ceph.
First thing I would say, is you may need to reset your expectations of complexity. The subject for your question alone includes:
- distributed
- fault-tolerant
- network block device
Each of these alone is generally a topic of at least moderate complexity. Combining all three of them, and you're not going to accomplish it without a little bit of work.
I think what you're missing is something that can feasibly accomplish all of your requirements and still be simple or easy. Some of your requirements are very difficult to implement together, if not entirely contradictory. Individually can be accomplished without too much difficulty, but putting them all together is where it gets tricky.
I'm going to run through each of the requirements and provide comments:
A client's block device should write simultaneously to several storage nodes
This can be accomplished by using redundant storage under the hood. The redundancy could be accomplished at the "storage node" level, using redundant local storage (RAID and such), or at the network level by duplicating the data to multiple nodes.
A client's block device should not fail as long as not all storage nodes backing it went down
Along with the previous, this is easily accomplished with redundancy in the storage. This part would require that the storage be implemented in a "network RAID1" type setup.
The master should automatically redistribute storages' data when a storage node fails or gets added/ removed
Here is where things get difficult. You specifically stated that you want a block device exported. That makes this feature much more difficult on the back and, unless you are replicating the entire block device. With a block device, the server-side functionality can't look at a file and duplicate the blocks that make up that file, like it could when it presents a filesystem interface. That leaves the server-side either treating the entire block device as a whole, and needing to replicate every block in it's entirety to a single separate location, or it has to implement a lot of quirky intelligence to get good reliability, consistency, and performance. Very few systems implement something like this right now.
A single master (which is for metadata only) is fine
As a concept, this works much better when you're dealing with file chunks from a filesystem than it does with block devices. Most of the systems that implement something like this do so with either filesystem interface, or a pseudo-filesystem interface.
Generally, you're making a decision. You get your remote storage as a filesystem, in which case you're accessing a high-level interface, and allowing the storage-side to make decisions and handle the low-level details for you, or you're getting the storage as a block device, in which case you are taking responsibility for those features, or at least most of them. You're getting your storage at a lower-level, and that leaves more work for you to implement those low-level features (distributed, fault-tolerant, etc).
Also, you need to remember that as a general rule, fault-tolerance and high-performance are opposite ends of the same spectrum with a given set of hardware. As you increase the redundancy, you decrease the performance. The simplest example is if you have 4 disks. You can stripe all 4 of them in a RAID0 for maximum performance, or you can duplicate a the same data 4 times across all for disks. The former will give you maximum performance, the latter maximum redundancy. In between there are various trade-offs such as a 4 disk RAID5, or my personal preference, a 4 disk RAID10.
If I were putting together something meeting your requirements, I'd probably export all the disks with iSCSI or ATA Over Ethernet (AoE), and use MD software RAID or LVM mirroring (or a combination of the two) to get the level of redundancy I needed.
Yeah, there's some manual work to setup and maintain it, but it gives you precise control over things to reach the level of fault tolerance and performance required. DRBD is another option that could fit into it, but if you're going to deal with more than a couple of "storage nodes", I'd probably pass on it.
Update: The above assumes you're wanting to build your own solution. If you've got a big enough budget, you can buy a SAN/NAS solution that, while it probably won't be exactly as you describe above, could be treated as a black box with the same rough functionality.
You are describing a SAN. If you want to build it yourself, you probably can, but I can't help you more than pointing you in the direction of ZFS. If you end up buying one from a storage vendor, you'll want to change the way you describe it. Here is a breakdown of what you're asking for:
- "A client's block device should write simultaneously to several storage nodes": this equates to multiple controllers in an active/active multipath environment. Each write will only be sent to a single node, however multiple writes will tend to take multiple paths if you configure the local multipath driver properly.
- "A client's block device should not fail as long as not all storage nodes backing it went down": This equates to not having a single point of failure. Each node has to be capable of handling the traffic of the entire infrastructure, and there should be two distinct networks for sending IO to the box that share no points of failure. If you go with a fibre channel storage device, this will mean having two switches and not linking them to each other.
- "The master should automatically redistribute storages' data when a storage node fails or gets added/ removed": This equates to two things. First, drive failure recovery. If a drive fails, the storage should re-create the data from parity or copies (depending on whether the storage uses RAID or something like it) and replace the lost disk's contents on good disks. Second, it also refers to controller failure. If a controller fails, the hosts should be able to continue as if nothing happened, and all in-flight IO should be treated without failures. This is accomplished through cache mirroring, or ensuring that a write is not acknowledged until it has been safely saved to more than one cache.
I'd add to your list if I knew more about your environment, but this will be able to get you started.