Erlang / OTP消息是否可靠?消息可以重复吗? [英] Are Erlang/OTP messages reliable? Can messages be duplicated?

查看:185
本文介绍了Erlang / OTP消息是否可靠?消息可以重复吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

长版本

我很喜欢erlang,并考虑将其用于可扩展架构。我发现许多平台的支持者都嘲笑其可靠性和容错性。

I'm new to erlang, and considering using it for a scalable architecture. I've found many proponents of the platform touting its reliability and fault tolerance.

然而,我正在努力地了解这个系统中如何实现容错,消息在瞬态内存中排队。我理解,可以安排一个主管层级来重建已故过程,但是我一直无法找到关于重新启动对正在进行的工作的影响的很多讨论。飞行中的消息和部分完成的工作在垂死的节点上丢失的工件会发生什么?

However, I'm struggling to understand exactly how fault-tolerance is achieved in this system where messages are queued in transient memory. I understand that a supervisor hierarchy can be arranged to respawn deceased processes, but I've been unable to find much discussion of the implications of respawning on works-in-progress. What happens to in-flight messages and the artifacts of partially-completed work that were lost on a dying node?

所有生产者是否会自动重新发送不被确认的消息消费者流程死亡时如果没有,怎么可以认为是容错的?如果是这样,那么什么可以阻止已被处理的消息(但不是很确认)被重新发送,因此不适当地被重新处理?

Will all producers automatically retransmit messages that are not ack'd when consumer processes die? If not, how can this be considered fault-tolerant? And if so, what prevents a message that was processed -- but not quite acknowledged -- from being retransmitted, and hence reprocessed inappropriately?

(我认识到这些担忧是不是erlang的独特之处,任何分布式处理系统都会出现类似的问题,但是erlang爱好者似乎声称该平台使这一切都变得容易了..?)

(I recognize that these concerns are not unique to erlang; similar concerns will arise in any distributed processing system. But erlang enthusiasts seem to claim that the platform makes this all "easy"..?)

假设消息被重传,我可以很容易地设想一个复杂消息链的下游效应在故障之后变得非常混乱的场景。没有某种沉重的分布式事务系统,我不明白如何保持一致性和正确性,而无需解决每个进程中的重复。我的应用程序代码总是强制执行约束,以防止事务被执行不止一次?

Assuming messages are retransmitted, I can easily envision a scenario where the downstream effects of a complex messaging chain could become very muddled after a fault. Without some sort of heavy distributed transaction system, I don't understand how consistency and correctness can be maintained without addressing duplication in every process. Must my application code always enforce constraints to prevent transactions from being executed more than once?

简短版本:

分发的erlang进程是否受到重复的消息的影响?如果是,重复保护(即幂等)应用程序的责任,或者erlang / OTP是否以某种方式帮助我们?

Are distributed erlang processes subject to duplicated messages? If so, is duplicate-protection (ie, idempotency) an application responsibility, or does erlang/OTP somehow help us with this?

推荐答案

p>我将这个分为几点我希望会有意义。我可能会重新填写一些我在 中写过的搭便车手指南并发 。您可能想要阅读这篇文章,以获取Erlang中邮件传递完成方式的细节。

I'll separate this into points I hope will make sense. I might be re-hashing a bit of what I have written in The Hitchhiker's Guide to Concurrency. You might want to read that one to get details on the rationale behind the way message passing is done in Erlang.

<强> 1。消息传输

1. Message transmission

Erlang中的消息传递是通过发送到邮箱的异步消息(一种用于存储数据的队列)完成的。绝对假设消息是否被接收,或者甚至被发送到有效的进程。这是因为假设[在一个语言层面]某人可能只想在4天内处理一个信息,甚至在它达到某个状态之前甚至不会承认它的存在是合理的。

Message passing in Erlang is done through asynchronous messages sent into mailboxes (a kind of queue for storing data). There is absolutely no assumption as to whether a message was received or not, or even that it was sent to a valid process. This is because it is plausible to assume [at a language level] that someone might want to treat a message in maybe only 4 days and won't even acknowledge its existence until it has reached a certain state.

一个随机的例子可能是想象一个长时间运行的过程,它可以处理4小时的数据。是否真的确认它收到消息,如果它无法对待它?也许应该也许不是。这取决于你的应用。因此,不作任何假设。您可以将一半的消息异步,只有一个不是。

A random example of this could be to imagine a long-running process that crunches data for 4 hours. Should it really acknowledge it received a message if it's unable to treat it? Maybe it should, maybe not. It really depends on your application. As such, no assumption is made. You can have half your messages asynchronous and only one that isn't.

Erlang希望您发送确认消息(并等待超时),如果您有需要它。与定时有关的规则和答复的格式由程序员指定 - Erlang不能假设您希望在消息接收时确认,任务完成,是否匹配(消息可以在4小时内匹配新版本的代码热装载)等。

Erlang expects you to send an acknowledgement message (and wait on it with a timeout) if you ever need it. The rules having to do with timing out and the format of the reply are left to the programmer to specify -- Erlang can't assume you want the acknowledgement on message reception, when a task is completed, whether it matches or not (the message could match in 4 hours when a new version of the code is hot-loaded), etc.

为了缩短,是否读取消息,没有被接收或被中断的人拉动插头,当它在运输没有关系,如果你不想要它。如果你想要这么重要,你需要设计一个跨进程的逻辑。

给出了Erlang进程之间实现高级消息协议的负担给程序员。

The burden of implementing a high-level message protocol between Erlang processes is given to the programmer.

2。消息协议

2. Message protocols

正如你所说,这些消息存储在瞬态内存中:如果进程中断,则所有未读的消息但失去了如果你想要更多,有各种各样的策略。其中一些是:

As you said, these messages are stored in transient memory: if a process dies, all the messages it hadn't read yet are lost. If you want more, there are various strategies. A few of them are:


  • 尽快阅读消息并将其写入磁盘,如果需要,发回确认并处理它后来。将其与队列软件(如RabbitMQ和ActiveMQ)与持久性队列进行比较。

  • 使用进程组在多个节点上的一组进程上复制消息。此时您可能会输入事务语义。这一个用于事务提交的mnesia数据库;

  • 在收到一切正常或失败消息的确认之前,不要假定任何事情都有效b $ b
  • 进程组和故障消息的组合。如果第一个进程无法处理任务(因为节点出现故障),则VM会自动将通知发送到故障切换进程,而该进程会处理它。这种方法有时用于完整的应用程序来处理硬件故障。

根据手头的任务,您可以使用一个或多个这些。他们都可以在Erlang中实施,在许多情况下,模块已经写好了,为你做了很大的努力。

Depending on the task at hand, you might use one or many of these. They're all possible to implement in Erlang and in many cases modules are already written to do the heavy lifting for you.

所以这可能会回答你的问题。 因为您自己实现协议,所以您可以选择是否发送消息不止一次。

So this might answer your question. Because you implement the protocols yourself, it's your choice whether messages get sent more than once or not.

<强烈> 3。什么是容错

3. What is fault-tolerance

选择上述策略之一取决于您对您的容错意味着什么。在某些情况下,人们意味着说没有数据丢失,没有任务失败。其他人使用容错来说用户从未看到崩溃。在Erlang系统的情况下,通常的意思是保持系统运行:可能有一个用户拨打电话而不是让所有人都放弃它。

Picking one of the above strategies does depend on what fault-tolerance means to you. In some cases, people mean it to say "no data is ever lost, no task ever fails." Other people use fault-tolerance to say "the user never sees a crash." In the case of Erlang systems, the usual meaning is about keeping the system running: it's alright to maybe have a single user dropping a phone call rather than having everyone dropping it.

这里的想法是让失败的东西失败,但保持休息。为了实现这一点,VM提供了一些东西:

Here the idea is then to let stuff that fails fail, but keep the rest running. To achieve this, there are a few things the VM gives you:


  • 你可以知道进程何时死机,为什么这样做>
  • 如果其中一个出错,您可以强制依赖于彼此的进程一起死亡

  • 您可以运行一个自动记录每个没有捕获到的例外,甚至定义自己的

  • 可以监视节点,以便知道何时下载(或断开连接)

  • 您可以重新启动失败的进程(或失败的进程组)

  • 如果一个失败,让整个应用程序重新启动不同的节点

  • 还有更多使用OTP框架的东西

  • You can know when a process dies and why it did
  • You can force processes that depend on each other to die together if one of them goes wrong
  • You can run a logger that automatically logs every uncaught exception for you, and even define your own
  • Nodes can be monitored so you know when they went down (or got disconnected)
  • You can restart failed processes (or groups of failed processes)
  • Have whole applications restarting on different nodes if one fails
  • And a lot more more stuff with the OTP framework

使用这些工具和一些标准库的模块为您处理不同的方案,您可以实现几乎你想要的是Erlang的异步语义,尽管通常可以使用Erlang对容错的定义。

With these tools and a few of the standard library's modules handling different scenarios for you, you can implement pretty much what you want on top of Erlang's asynchronous semantics, although it usually pays to be able to use Erlang's definition of fault tolerance.

4。一些笔记

4. A few notes

我个人的意见是,除了Erlang之外,还有更多的假设是非常困难的,除非你想要纯粹的事务语义。一直遇到问题的一个问题是节点下降。您无法知道他们是否因为服务器实际崩溃或因为网络失败而失败。

My personal opinion here is that it's pretty hard to have more assumptions than what exists in Erlang unless you want pure transactional semantics. One problem you'll always have trouble with is with nodes going down. You can never know if they went down because the server actually crashed or because the network failed.

在服务器崩溃的情况下,只需重新执行任务就很容易足够。但是,通过净分割,您必须确保一些重要的操作不会执行两次,但不会丢失。

In the case of a server crash, simply re-doing the tasks is easy enough. However with a net split, you have to make sure some vital operations are not done twice, but not lost either.

通常归结为 CAP定理,它基本上给你3个选项,你必须选择两个选项:

It usually boils down to the CAP theorem which basically gives you 3 options, of which you have to pick two:


  1. 一致性

  2. 分区容限

  3. 可用性

根据您的位置,您将需要不同的方法。 CAP定理通常用于描述数据库,但我相信类似的问题是在处理数据时需要一定程度的容错问题。

Depending on where you position yourself, different approaches will be needed. The CAP theorem is usually used to describe databases, but I believe similar questions are to be asked whenever you need some level of fault tolerance when processing data.

这篇关于Erlang / OTP消息是否可靠?消息可以重复吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆