Please Survive an Availability Zone Outage

If you’re a cloud watcher like me, you probably heard about a major Alibaba Cloud outage in their Hong Kong region a few days ago that impacted some customers for up to 24 hours. They posted a detailed incident report.

It’s always interesting and informative to read well-written incident reports because you can learn things that help you as a customer (or you as a cloud infrastructure provider) to improve and avoid incidents in the future. So I opened that link up expecting to hear about a series of unfortunate events involving complex distributed computing hardware and software.

Instead, the Alibaba Cloud outage was fundamentally an HVAC issue — a “high temperature” event where the fire supression mechanisms started to kick in. Curious, because HVAC problems in cloud data centers are usually by design isolated to a single availability zone (AZ). And in fact this one was. But if you are one of the impacted customers, make no mistake:

Your cloud system should be able to tolerate a failure in a single availability zone.

In fact, it’s right there in the name — it’s called an availability zone for a reason. The major cloud providers that build this architecture divide their infrastructure into geographically dispersed regions, and within each region are physically separate availability zones with ultra-fast networking interconnects.

The thing that AZs give you is literally supposed to be insurance against this type of real-world event. Of course, to take advantage of that insurance, you have to make sure your application is correctly architected.

But the thing is, in most cases, unlike going multi-region, running multi-AZ really isn’t that difficult, and most of the commonly used cloud services and patterns steer you in that direction automatically.

For example, if you’re running a non-trivial service, you probably have your web servers deployed in a cluster. When you set that up, it’s usually easier than not to have your cluster be resilient to an AZ failure. And almost all serverless services don’t have single-AZ points of failure.

For the handful of services that aren’t naturally easy to run multi-AZ, there are well-known and well-documented procedures for avoid, detecting, and recovering from a single-AZ failure while meeting your RPO or RTO.

If you are a cloud architect, you need only to:

Architect to make as much of your system as possible handle an AZ failure transparently.
Understand what services you use have a single-AZ dependency.
Document (if necessary) what procedures need to be taken in an AZ failure.
Test that those procedures work. (Your cloud provider can probably help you simulate an AZ failure.)

Now might be a good time to go and check these things out in your cloud infrastructure. Let me know what kinds of issues you find and how challenging they are to resolve.

Please Survive an Availability Zone Outage

Southwest Airlines Could Just Use Kubernetes, Right?

What Do You Hope to Get by Going Cloud-Native?