Redundant NFS on Amazon EC2
On AWS, using GlusterFS with an Elastic Load Balancer and auto scaling EC2 instances should achieve what you want. I can't comment about any other IaaS.
Amazon does provide some of what you need to achieve your objective - and allows you to implement the rest.
Amazon's EC2 servers are essentially VPSes - you can setup Heartbeat/Corosync/Pacemaker, etc on them (although last time I checked, you cannot use broadcast on their network - you can use unicast though - udpu).
You mention two ideas which Amazon addresses (somewhat) separately: fault tolerance and redundancy.
There is no built in mechanism for redundancy on EC2, although depending on what you are looking for, there are some ways to achieve it.
- Theoretically, S3 is designed with multiple layers of redudancy and "designed to provide 99.999999999% durability of objects over a given year". Their SLA is for 99.9% availability per year. If you want to go that route for static files, you can mount an S3 bucket using s3fuse as a local file system. This is fairly slow however, and not really advisable for most purposes (code, databases, server software, etc).
- EBS snapshots will provide you with a compressed, differential point-in-time image of your EBS volume. These are great as a backup - and you can launch new instances from a snapshot - they are not, however true redundancy.
- For any solution of actual redundancy, you must set it up yourself. One approach designed for this problem is GlusterFS. You can setup your bricks as distributed, replicated, or both, and data will be spread across the system - it is resilient to the removal of individual nodes, and they have a pre-built AMI that you can launch multiple instances from to build a cluster.
Fault tolerance, on the other hand, is better provided for by the Amazon platform:
- The EC2 network offers multiple regions and availability zones - which (theoretically) provide isolated and/or geographically separated data centres to avoid single points of failure
- Amazon offers monitoring (Cloudwatch) of a variety of instance metrics (CPU, network, disk I/O, etc), as well as custom metrics. These can be used as a trigger for launching new instances from a pre-built AMI, a process called 'Auto scaling'.
- EC2 has Elastic IP addresses - these are public IP addresses that can be reserved and quickly remapped to another instance on demand, allowing you to avoid the delays of DNS propagation when an instance goes down.
- Finally, Amazon has Elastic Load balancers - these are supposed to be designed to avoid a single point of failure, and to scale with incoming traffic (they do not suffer from the same bandwidth limitations that a single instance setup as a load balancer would be subject to). ELBs are able to monitor the 'health' of the back end instances, and work with auto scaling to maintain an appropriate number of instances.
In addition to the above, you can pass custom parameters to your newly launched instances, or retrieve information about your currently running instances fairly easily - which may allow you to script some of the setup (and, of course, AWS does have an API that will let you script all the actions they offer - including remapping an elastic IP address, launching new instances, detaching/attaching EBS volumes, etc).
You described 'files are kept on a separate, redundant EBS...[which is then] mounted'. Firstly, on EC2, an EBS volume can only be attached to one instance at a time (so to copy data to it, the EBS volume would need to be attached). It is up to you to maintain redundancy (you can setup RAID arrays of EBS devices, or do pretty much anything else). The problem though, is that sometimes EBS volumes are not detached when an instance actually crashes - you can force detach them though (which has a better, but not perfect success rate), and you can snapshot a EBS volume, even in use (which you could then create a new EBS volume from and launch an AMI using). It is better (lower time to recover, more flexible, etc) though, to maintain replicas of your data across multiple instances, as opposed to across multiple EBS volumes on the same instance.