Erlang的99.9999999%(九九)可靠性 [英] Erlang's 99.9999999% (nine nines) reliability

查看:380
本文介绍了Erlang的99.9999999%(九九)可靠性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据报道, Erlang 已在生产系统中使用了20多年,其正常运行时间为99.9999999%。



我做了以下数学:

  20 * 365.25 * 24 * 60 * 60 *(1  -  0.999999999)== 0.631 s 

这意味着系统只有不到一个二十年来第二次停机。我不是在质疑这一点的有效性,我只是好奇,我们可以关闭系统(故意或意外)只有0.631秒。熟悉大型软件系统的人都可以向我们解释这一点吗?有没有人知道如何通过一组处理单元(或机器)计算服务的停机时间。 ?

解决方案

可靠性数据不应该是衡量总体时间的任何部分 AXD301 (有关项目)已经关闭20多年了。它代表了这些20年以来由 AXD301 系统提供的服务已经脱机的总时间。细微差别正如Joe Armstrong所说: here


AXD301已经达到了九项九项可靠性(是的,你读得正确,99.9999999%)。让我们把它放在上下文中:5个人被认为是好的(5.2分钟的停机时间/年)。 7个人几乎不可实现...但我们做了9。



为什么是这样?没有共享状态,还有一个复杂的错误恢复模式。


如果你深入挖掘,在Joe写的博士论文中Erlang的原始作者(其中包括案例研究 AXD301 ),您阅读:


本章中研究的项目之一是爱立信AXD301,
高性能高可靠性ATM交换机


所以,只要交换机的一部分网络正在运行而不停机,作者可以为AXD301/$ c>(这是他曾经说过的,避免细节)。这不一定意味着Erlang是这样高可靠性的唯一原因。



编辑:其实20年本身似乎是一个误解。 Joe在同一篇文章中提到了20年的数字,但它并没有与九个可靠性数据实际相关,这可能性远远低于其他人提到的研究(


Erlang was reported to have been used in production systems for over 20 years with an uptime percentage of 99.9999999%.

I did the math as the following:

20*365.25*24*60*60*(1 - 0.999999999) == 0.631 s

That means the system only has less than one second of downtime during the period of 20 years. I am not trying to challenge the validity of this, I am just curious about how we can shut down a system (on purpose or by accident) for only 0.631 second. Could anyone who are familiar with large software system explain this to us? Thank you.


Does anyone know how to calculate the downtime of a service over a cluster of processing units (or machines)?

解决方案

The reliability figure wasn't supposed to measure the total time any part of AXD301 (project in question) was ever shut down for over 20 years. It represents the total time over those 20 years that the service provided by the AXD301 system was ever offline. Subtle difference. As Joe Armstrong says here:

The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable ... but we did 9.

Why is this? No shared state, plus a sophisticated error recovery model.

If you dig a bit deeper, in the PhD thesis written by Joe, the original author of Erlang (which includes a case study of AXD301), you read:

One of the projects studied in this chapter is the Ericsson AXD301, a high-performance highly-reliable ATM switch.

So, as long as the network that the switch was a part of was running without downtime, the author can state "nine nines reliability" for AXD301 (which was all he ever said, avoiding specifics). It doesn't necessarily mean Erlang is the only cause of such high reliability.

EDIT: In fact, "20 years" itself seems like a misinterpretation. Joe mentions a figure of 20 years in the same article, but it's not actually connected to the nine-nines reliability figure, which potentially came out of a much shorter study (as others have mentioned).

这篇关于Erlang的99.9999999%(九九)可靠性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆