If Hand Sanitizer Has Four 9’s of Reliability, Does Your System Need Five?
Recently in a moment of extreme boredom I found myself pondering one of the many extra bottles of hand sanitizer I have left over from the pandemic, and a phrase on the bottle caught my eye:
Kills 99.99% of Germs
My distributed systems engineering mind immediately interpreted that as “four 9’s of reliability” against germs, and then the obvious (it’s obvious, right?) follow-up question is, why do software engineers so often get asked for five 9’s of uptime? Does this make any sense?
Let’s get this out of the way first: If you work in safety-critical systems — medical, aerospace, and the like — ignore what I’m about to say because it does not apply to you. Go for all those 9’s, as many as required and as many as possible. (But also reach out if you’ve read this far because I’d love to hear about the validation and verification techniques you’re using.)
For the rest of you, you probably work in things like traditional IT, e-commerce, socical, developer tools and infrastructure, or marketplaces. How many 9’s do you need? Two? Four?
Another thing I’d like to clarify is that I’m certainly not going to suggest that you do no dedicated reliability work whatsoever. In fact, there are a lot of well-understood architectural and operational approaches that should allow most engineering teams and software systems to proceed fairly smoothly as far as, say, three 9’s. You aren’t doing your job correctly as a professional developer if you don’t at least follow these best practices when building and deploying your applications.
I’d like to focus on what it means — especially to the business — to go for even higher levels of relability, the four- and five-9’s.
Reliabilty is a Trade-Off
In most domains, reliablity is just another engineering trade-off and should be treated as such. We usually look at engineering trade-offs in terms of some cost/benefit analysis.
The benefits of higher reliability are often quantifiable via metrics like lost sales in e-commerce, which you can predict per minute of downtime with a litle bit of history and scale. Who doesn’t like preventing lost sales, right?
But the costs of achieiving that high reliability can be substantial. Let’s take a look at three sources of reliability-related costs.
Infrastructure
The first cost of “adding a 9” is infrastructure. This might take the form of redundant network, compute, and storage. It probably also involves extra data center or cloud region presence. With more and more companies starting to scrutinize their infrastructure bills, this one might directly hurt the bottom line.
Software Development
The next cost is software development. If you’re primarily running a static website, you’re in the clear. But for everybody running anything more complicated than that, you’re probably going to have to write some extra code to get higher levels of reliability. And potentially a lot of it.
Most AWS services only have a 99.9% or at best 99.99% SLA. It’s not impossible but it’s also no small trick to get four 9’s of reliability on top of three-9’s infrastructure. (I’ll leave some of the techniques for doing that as an exercise for the reader, and also note that this can be a good systems design interview question.) If you have more than a couple of different moving parts in your architecture and it’s sitting at least in part on top of 99.9% infrastructure, reliability math starts to work against you very quickly. And just when you thought it couldn’t get any worse, you might find that the work you do to “add reliability” may perversely turn into additional sources of failure modes in your system.
Opportunity Cost
Finally, adding levels of reliability comes with opportunity costs, especially for startups and smaller companies. Sure, you can spend an engineer-month saving yourself that extra 5 minutes of downtime so that you don’t lose 5 minutes of sales. But what if you could have spent that engineer-month adding new growth features that increase checkouts 10% on a 24/7/365 basis? I know what most businesses would choose…
About That Hand Sanitizer
To close, I’d like to return to my bottle of hand sanitizer. It touts its four 9’s of reliability as a key marketing feature, a point of pride. So don’t assume you have to build in five 9’s of reliability to have a successful software system.