What design features make Joyent's ZFS and Amazon's EBS (S3) reliable?

I know this isn't exactly an apples to apples comparison, what I'm trying to evaluate is which one is safer - less likely to lose data.

Joyent's SmartOS uses ZFS to store data whereas an EC2 machine can use Amazon Elastic Block Store (EBS) which stores its data on S3.

I'm wondering what are some of the architectural details that make the two systems reliable? I'm not too sure of the design of S3 on whether they have more than one location at which they store their data.

As you said, this isn't exactly an apples to apples comparison (in addition there is agreement already, that decent data backup procedures must be in place for both, so I'm not going to address this). Therefore the question cannot be answered as such, rather one should be aware of the architectural details of each offering and apply those in respect to a particular use case at hand.

In particular, the ZFS based storage system from Joyent is a local storage system designed to deliver carrier-grade storage and data reliability, see Data Resiliency and Reliability:

We put ZFS on top of a high performance local storage subsystem to ensure that your data is safe, consistent, and always accessible and recoverable. ZFS is a combined file system and logical volume manager designed for pooled local storage. Unlike other file systems deployed for cloud storage, ZFS’ copy-on-write capability guarantees that your image will not be lost. [emphasis mine]

In contrast, EBS is a network block storage system designed to provide highly available, highly reliable storage volumes that can be attached to a running Amazon EC2 instance and exposed as a device within the instance, see section Features of Amazon EBS volumes within Amazon Elastic Block Store (EBS) for details, e.g.:

Amazon EBS volumes are placed in a specific Availability Zone, and can then be attached to instances also in that same Availability Zone.

Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component.

Amazon EBS also provides the ability to create point-in-time snapshots of volumes, which are persisted to Amazon S3. These snapshots can be used as the starting point for new Amazon EBS volumes, and protect data for long-term durability. [...]

[emphasis mine]

The latter point highlights that EBS does not store its data on S3 in itself, rather provides an easy to use backup mechanism for long-term durability via S3, which implies you will need to assess both scenarios separately in terms of durability and availability though.

Section Amazon EBS Volume Durability further details this architecture:

[...] Amazon EBS volume data is replicated across multiple servers in an Availability Zone to prevent the loss of data from the failure of any single component. The durability of your volume depends both on the size of your volume and the percentage of the data that has changed since your last snapshot. [...]

Because Amazon EBS servers are replicated within a single Availability Zone, mirroring data across multiple Amazon EBS volumes in the same Availability Zone will not significantly improve volume durability. However, for those interested in even more durability, Amazon EBS provides the ability to create point-in-time consistent snapshots of your volumes that are then stored in Amazon S3, and automatically replicated across multiple Availability Zones. [...]

[emphasis mine]

So while EBS stores data on multiple servers in one availability zone only, S3 provides the extra mile of physical infrastructure separation, see How isolated are Availability Zones from one another?:

Each availability zone runs on its own physically distinct, independent infrastructure [...]. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone. [emphasis mine]

This yields a claimed durability of 99.999999999% as outlined in How durable is Amazon S3? and further detailed in How is Amazon S3 designed to achieve 99.999999999% durability?:

Amazon S3 redundantly stores your objects on multiple devices across multiple facilities in an Amazon S3 Region. [...] When processing a request to store data, the service will redundantly store your object across multiple facilities before returning SUCCESS. [...] [emphasis mine]

Please note, that an availability zone is still constraint to a single region (see Using Regions and Availability Zones for details on this architecture), and their have been respective incidents already, triggering discussions whether region and/or provider redundancy is the way to go for utmost reliability (see Outages below).

Finally, section Amazon S3 Data Consistency Model in Amazon S3 Concepts provides more details how Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers.

Outages

Both services had at least one major outage in the past - the respective post mortem analysis provides additional insight into the design of each system and allow you to account for this in backup and availability strategies accordingly:

Joyent - Further Strongspace and BingoDisk Update
Amazon - Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region
- section Overview of EBS System features an insightful summary of the EBS architecture

The latter outage sparked quite some discussion regarding reliability of cloud computing in general, which interestingly triggered the article Magical Block Store: When Abstractions Fail Us on Joyent's blog, exploring the differences between both approaches and explaining Joyent's respective architectural choices (including self-criticism of former failed attempts); while this article obviously might be considered biased, it should still allow you to draw your own conclusions in turn.

You don't have the data unless you have it in triplicate at two geographically different locations.

Depending on single RAID instance, virtual block device, single supplier, etc. to reliably store your data is careless at best.

That being said, unless nothing changed during past 2-3 years since I last checked, Amazon doesn't give any guarantee that S3 data will be there next time you look. They have been reliable during the past few years as far as storage is concerned so it's not like like the data regularly disappears.

What design features make Joyent's ZFS and Amazon's EBS (S3) reliable?

Outages

Related

Recent Posts