Cloud regions apparently fail now and then. How resilient should one expect a cloud region to be against fire or natural disasters?

Today the SBG2 availability zone in the SBG region burned down. All other AZ's (SBG1,2 AND 4) was shut down due to this, and will remain shut down until at least monday.

All SBG zones are apparently located at the same physical site, but in different "buildings".

A common pattern used when designing high availability infrastructure is to spread your infrastructure across multiple availability zones. However, as happened yesterday, this also fails from time to time. This makes me think that a solid infrastructure should involve at least two regions.

Is it a common design for a data center region to locate availability zones as close as OVH did, or will others such as e.g GCE have AZ's within a region farther apart from each other in the case of e.g. fire?

In general a region or an availability zone is a logical separation of cloud resources that is made visible to you as the end user.

The theory is that the cloud provider maps their regions and AZ’s to failure domains in their physical infrastructure. Such a failure domain can be as small as a single server rack, a room, kwadrant, or floor in a datacenter with many server racks, or as large as a complete datacenter.

An outage that is limited to a single failure domain, won’t (or rather shouldn’t) impact anything running outside of that failure domain.

For example a short circuit that trips the fuses may take down all systems in a single rack but nothing beyond that rack will be impacted.

So when you as the customer spread your virtual infrastructure over multiple AZ’s in a single region you can be sure that outages will not impact all of your virtual infrastructure at the same time as long as the outage is confined to a single AZ and had outage does not impact multiple AZ’s or the complete region.

Most cloud providers don’t publish exactly how their AZ’s correspond to physical failure domains. So you can’t tell either what kind of outages will transcend the borders of the failure domain and impact multiple (if not all) AZ’s in a region or the complete region.

Google compute says

Compute Engine resources are hosted in multiple locations worldwide. These locations are composed of regions and zones. A region is a specific geographical location where you can host your resources. Regions have three or more zones. For example, the us-west1 region denotes a region on the west coast of the United States that has three zones: us-west1-a, us-west1-b, and us-west1-c.

So to the discerning reader that suggests that there are potential scenarios that a complete region and all AZ’s therein will become unavailable similar to what happened at OVH.

Regions are generally not physically close to each other ad explained here https://cloud.google.com/about/locations

OVH SBG 1 2 and 3 are all next to each other, on a former industrial site on the banks of the Rhine river. When one of them burns to the ground, not surprising fire containment required everything to be shut down. Physical safety takes priority.

OVH Strasbourg data center fire, overhead view. From SDIS 67, via datacenterdynamics.com.

Data centers this close together is not necessarily a problem. Several isolated failure domains still, separate power and network and such. A building catching on fire is a hopefully rare event, possibly contained to a room with fire suppression.

As a part of business continuity planning, think about regional scale scenarios. In the category of natural disasters, say a hurricane hits the area. All AZs in a region could be impacted with damage and power outages. Separating the buildings by a kilometer may not help in such a large scale event.

Backups shipped off site to a distant city are likely to escape even the worst disasters. At least your data survives.

I'll answer my own question as well to share what I've found after having read up a little on how the various cloud providers design their regions.

In general, many / most cloud providers are unspecific when they describe how they physically lay out availability zones in a region. This is also confirmed by a recent critique from AWS where it is pointed out that their rivals Google and Azure are reluctant to give any clear description of how their regions work in terms of availability zones.

In general, you will not find any documentation on where each data center / availability zone within a region is located. This goes for most cloud providers. AWS is not sharing details on where their availability zones are located either, but does write that their availability zones always are located at least several kilometers apart.

Azure :

To ensure resiliency, there's a minimum of three separate zones in all enabled regions. The physical separation of Availability Zones within a region protects applications and data from datacenter failures.

https://docs.microsoft.com/en-us/azure/availability-zones/az-overview

Google :

A zone is a deployment area for Google Cloud resources within a region. Zones should be considered a single failure domain within a region. To deploy fault-tolerant applications with high availability and help protect against unexpected failures, deploy your applications across multiple zones in a region.

https://cloud.google.com/docs/geography-and-regions

AWS :

AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.

https://aws.amazon.com/about-aws/global-infrastructure/regions_az/

Cloud regions apparently fail now and then. How resilient should one expect a cloud region to be against fire or natural disasters?

Related

Recent Posts