Erlang's 99.9999999% (nine nines) reliability

Solution 1:

The reliability figure wasn't supposed to measure the total time any part of AXD301 (project in question) was ever shut down for over 20 years. It represents the total time over those 20 years that the service provided by the AXD301 system was ever offline. Subtle difference. As Joe Armstrong says here:

The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable ... but we did 9.

Why is this? No shared state, plus a sophisticated error recovery model.

If you dig a bit deeper, in the PhD thesis written by Joe, the original author of Erlang (which includes a case study of AXD301), you read:

One of the projects studied in this chapter is the Ericsson AXD301, a high-performance highly-reliable ATM switch.

So, as long as the network that the switch was a part of was running without downtime, the author can state "nine nines reliability" for AXD301 (which was all he ever said, avoiding specifics). It doesn't necessarily mean Erlang is the only cause of such high reliability.

EDIT: In fact, "20 years" itself seems like a misinterpretation. Joe mentions a figure of 20 years in the same article, but it's not actually connected to the nine-nines reliability figure, which potentially came out of a much shorter study (as others have mentioned).

Solution 2:

While the others have addressed the specific case you're asking about, your question seems to be based on a misapprehension. The way you've asked the question makes me believe you are thinking there is a manual process to getting the system running again after it crashes or is taken down for maintenance.

Erlang has several features that remove human working time as a source of downtime:

  1. Hot code reloading. In an Erlang system, it is easy to compile and load a replacement module for an existing one. The BEAM emulator does the swap automatically without apparently stopping anything. There is doubtless some tiny amount of time during which this transfer happens, but it's happening automatically in computer time, rather than manually in human time. This makes it possible to do upgrades with essentially zero downtime. (You could have downtime if the replacement module has a bug which crashes the system, but that's why you test before deploying to production.)

  2. Supervisors. Erlang's OTP library has a supervisory framework built into it which lets you define how the system should react if a module crashes. The standard action here is to restart the failed module. Assuming the restarted module doesn't immediately crash again, the total downtime charged against your system might be a matter of milliseconds. A solid system that hardly ever crashes might indeed accumulate only a fraction of a second of total downtime over the course of years of run time.

  3. Processes. These correspond roughly to threads in other languages, except that they do not share state except through persistent data stores. Other than that, communication happens via message passing. Because Erlang processes are very inexpensive (far cheaper than OS threads) this encourages a loosely-coupled design, so that if a process dies, only one tiny part of the system experiences downtime. Typically, the supervisor restarts that one process, with little to no impact on the rest of the system.

  4. Asynchronous message passing. When one process wants to tell another something, there is a first-class operator in the Erlang language that lets it do that. The message sending process doesn't have to wait for the receiver to process the message, and it doesn't have to coordinate ownership of data sent. The asynchronous functional nature of Erlang's message-passing system takes care of all that. This helps maintain high uptimes because it reduces the effect that downtime in one part of a system can have on other parts.

  5. Clustering. This follows from the previous point: Erlang's message passing mechanism works transparently between machines on a network, so a sending process doesn't even have to care that the receiver is on a separate machine. This provides an easy mechanism for dividing a workload up among many machines, each of which can go down separately without harming overall system uptime.

Solution 3:

The 99.9999999% availability figure is an often-quoted but fundamentally misleading statistic. Mats Cronqvist, one of the AXD-301 team members, gave a presentation (video) (which I attended) at the 2010 Erlang Factory conference in San Francisco, discussing this precise availability statistic. According to him, it was claimed by British Telecom for a trial period (I believe from January to September 2002) of "5 node-years" using the AXD-301. There were 14 nodes carrying live traffic by the end of the trial.

Cronqvist specifically stated that this is not representative of the entire AXD-301 history, or Erlang in general, and that he was not happy that Joe Armstrong kept quoting this, leading to overblown expectations of Erlang's reliability. Others have written that five nines is a more realistic figure.

It should be stated that I am a fervent Erlang supporter and developer, who believes that the expert use of Erlang can indeed lead to very highly available systems, but just wants to reduce the hype. I of course assume that Cronqvist's representation of the facts is accurate, and have no reason to believe otherwise.

Solution 4:

My understanding of those statistics is that it is computed over ALL AXD301 systems in production. We can expect that when an AXD301 has a severe problem, it would be down for more than 0.631 seconds. During this pediod, other AXD301 will take over to keep the network operational.

However, when you sum the total numbers of hours of all running AXD301, make the ratio for the one failing AXD301, you find 99.999999%

That's how I understand this figure.

Hope this help.