是什么导致 Azure 事件中心 ReceiverDisconnectedException/LeaseLostException? [英] What is causing Azure Event Hubs ReceiverDisconnectedException/LeaseLostException?

查看:22
本文介绍了是什么导致 Azure 事件中心 ReceiverDisconnectedException/LeaseLostException?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 EventProcessorHost 和 IEventProcessor 类(称为:MyEventProcessor)从 EventHub 接收事件.我通过在两台服务器上运行我的 EPH 将其扩展到两台服务器,并让它们使用相同的 ConsumerGroup 连接到集线器,但使用唯一的主机名(使用机器名称).

I'm receiving events from an EventHub using EventProcessorHost and an IEventProcessor class (call it: MyEventProcessor). I scale this out to two servers by running my EPH on both servers, and having them connect to the Hub using the same ConsumerGroup, but unique hostName's (using the machine name).

问题是:在白天/黑夜的随机时间,应用程序记录:

The problem is: at random hours of the day/night, the app logs this:

Exception information: 
Exception type: ReceiverDisconnectedException 
Exception message: New receiver with higher epoch of '186' is created hence current receiver with epoch '186' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used.
  at Microsoft.ServiceBus.Common.ExceptionDispatcher.Throw(Exception exception)
  at Microsoft.ServiceBus.Common.Parallel.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
  at Microsoft.ServiceBus.Messaging.IteratorAsyncResult`1.StepCallback(IAsyncResult result)

此异常与 LeaseLostException 同时发生,当它尝试检查点时从 MyEventProcessor 的 CloseAsync 方法抛出.(大概是因为 ReceiverDisconnectedException 正在调用 Close?)

This Exception occurs at the same time as a LeaseLostException, thrown from MyEventProcessor's CloseAsync method when it tries to checkpoint. (Presumably Close is being called because of the ReceiverDisconnectedException?)

我认为这是由于事件中心在扩展到多台机器时的自动租用管理造成的.但我想知道我是否需要做一些不同的事情来使它更干净地工作并避免这些异常?例如:有年代的东西?

I think this is occurring due to Event Hubs' automatic lease management when scaling out to multiple machines. But I'm wondering if I need to do something different to make it work more cleanly and avoid these Exceptions? Eg: something with epochs?

推荐答案

TLDR:这种行为绝对正常.

为什么租赁管理不能顺畅&无异常:让开发人员更好地控制情况.

很长的故事 - 从基础知识开始EventProcessorhost(在此 EPH - 与 __consumer_offset topicKafka Consumers 所做的非常相似 - 分区所有权和检查点store) 由 Microsoft Azure EventHubs 团队自己编写 - 将所有 EventHubs 分区接收器 Gu 转换为简单的 onReceive(Events) 回调.

EPH 用于在读取诸如 EventHubs 之类的高吞吐量分区流时解决 2 个一般的、主要的、众所周知的问题:

TLDR: This behavior is absolutely normal.

Why can't Lease Management be smooth & exception-free: To give more control on the situation to developer.

The really long story - all-the-way from Basics EventProcessorhost (hereby EPH - is very similar to what __consumer_offset topic does for Kafka Consumers - partition ownership & checkpoint store) is written by Microsoft Azure EventHubs team themselves - to translate all of the EventHubs partition receiver Gu into a simple onReceive(Events) callback.

EPH is used to address 2 general, major, well-known problems while reading out of a high-throughput partitioned streams like EventHubs:

  1. 容错接收管道 - 例如:问题的更简单版本 - 如果运行 PartitionReceiver 的主机死掉并返回 - 它需要从它离开的地方恢复处理.为了记住上次成功处理的 EventDataEPH 使用提供给 EPH 构造函数的 blob 来存储检查点 - 当用户曾经调用过 context.CheckpointAsync().最终,当主机进程终止时(例如:突然重新启动或遇到硬件故障并且永不/恢复) - 任何 EPH 实例都可以接受此任务并从该 Checkpoint 恢复>.

  1. fault tolerant receive pipe-line - for ex: a simpler version of the problem - if the host running a PartitionReceiver dies and comes back - it needs to resume processing from where it left. To remember the last successfully processed EventData, EPH uses the blob supplied to EPH constructor to store the checkpoints - when ever user invokes context.CheckpointAsync(). Eventually, when the host process dies (for ex: abruptly reboots or hits a hardware fault and never/comesback) - any EPH instance can pick up this task and resume from that Checkpoint.

EPH 实例之间平衡/分布分区 - 假设有 10 个分区和 2 个 EPH 实例处理来自这 10 个分区 - 我们需要一种方法来跨实例划分分区(EPH 库的 PartitionManager 组件就是这样做的).我们使用 Azure Storage - Blob LeaseManagement-feature 来实现这一点.从版本 2.2.10 开始 - 到简化问题,EPH 假设所有分区均等加载.

Balance/distribute partitions across EPH instances - lets say, if there are 10 partitions and 2 EPH instances processing events from these 10 partitions - we need a way to divide partitions across the instances (PartitionManager component of EPH library does this). We use Azure Storage - Blob LeaseManagement-feature to implement this. As of version 2.2.10 - to simplify the problem, EPH assumes that all partitions are loaded equally.

有了这个,让我们试着看看发生了什么:因此,首先,在上述 10 事件中心分区和 2 EPH 实例处理其中的事件的示例中:

With this, lets try to see what's going on: So, to start with, in the above example of 10 event hub partitions and 2 EPH instances processing events out of them:

  1. 假设第一个 EPH 实例 - EPH1 开始,首先,单独和启动的一部分,它为所有 10 个分区创建了接收器并正在处理事件.在启动时 - EPH1 将通过获取 10 存储 blob 上的租约来宣布它拥有所有这些 10 分区,代表这些 10 事件中心分区(具有标准的命名法 - EPH 在存储帐户中内部创建 - 从 StorageConnectionString 传递给 ctor).租约将获取一段时间,之后 EPH 实例将失去该分区的所有权.
  2. EPH1 不时地宣布 - 它仍然拥有这些分区 - 通过更新对 blob 的租约.更新 的频率以及其他有用的调整可以使用 PartitionManagerOptions
  3. 现在,假设 EPH2 启动了 - 你向 ctor 提供了与 EPH1 相同的 AzureStorageAccount> EPH2 也是如此.现在,它有 0 个分区需要处理.因此,为了实现跨 EPH 实例的分区平衡,它将继续download 具有 映射的所有 leaseblobs 的列表>ownerpartitionId.由此,它将STEAL租用,以获取partitions 的公平份额 - 在我们的示例中为 5,并且将公布关于该 lease blob 的信息.作为其中的一部分,EPH2 读取由 PartitionX 编写的最新检查点,它想要窃取租约并继续创建相应的 PartitionReceiverEPOCHCheckpoint 中的相同.
  4. 因此,EPH1 将失去这 5 个 分区 的所有权,并且会根据它所处的确切状态遇到不同的错误.
    • 如果 EPH1 实际上正在调用 PartitionReceiver.Receive() 调用 - 而 EPH2 正在创建 PartitionReceiver在同一个接收器上 - EPH1 将体验 ReceiverDisconnectedException.这最终将调用 IEventProcessor.Close(CloseReason=LeaseLost).请注意,如果接收到的消息较大或 PrefetchCount 较小,则遇到此特定异常的可能性会更高 - 因为在这两种情况下,接收器都会执行更积极的 I/O.
    • 如果EPH1处于checkpointingleaserenewinglease,而 EPH2 stole 租用,EventProcessorOptions.ExceptionReceived eventHandler 将用 leaselostException 发出信号(409 冲突错误 leaseblob) - 最终也会调用 IEventProcess.Close(LeaseLost).
  1. lets say the first EPH instance - EPH1 started, at-first, alone and a part of start-up, it created receivers to all 10 partitions and is processing events. In the start up - EPH1 will announce that it owns all these 10 partitions by acquiring Leases on 10 storage blobs representing these 10 event hub partitions (with a standard nomenclature- which EPH internally creates in the Storage account - from the StorageConnectionString passed to the ctor). Leases will be acquired for a set time, after which the EPH instance will loose the ownership on this Partition.
  2. EPH1 continually announces once in a while - that it is still owning those partitions - by renewing leases on the blob. Frequency of renewal, along with other useful tuning, can be performed using PartitionManagerOptions
  3. now, lets say, EPH2 starts up - and you supplied the same AzureStorageAccount as EPH1 to the ctor of EPH2 as well. Right now, it has 0 partitions to process. So, to achieve balance of partitions across EPH instances, it will go ahead and download the list of all leaseblobs which has the mapping of owner to partitionId. From this, it will STEAL leases for its fair share of partitions - which is 5 in our example, and will announce that information on that lease blob. As part of this, EPH2 reads the latest checkpoint written by PartitionX it wants to steal the lease for and goes ahead and creates corresponding PartitionReceiver's with the EPOCH same as the one in the Checkpoint.
  4. As a result, EPH1 will loose ownership of these 5 partitions and will run into different errors based on the exact state it is in.
    • if EPH1 is actually invoking the PartitionReceiver.Receive() call - while EPH2 is creating the PartitionReceiver on the same receiver - EPH1 will experience ReceiverDisconnectedException. This will eventually, invoke IEventProcessor.Close(CloseReason=LeaseLost). Note that, probability of hitting this specific Exception is higher, if the messages being received are larger or the PrefetchCount is smaller - as in both cases the receiver would be performing more aggressive I/O.
    • if EPH1 is in the state of checkpointing the lease or renewing the lease, while the EPH2 stole the lease, the EventProcessorOptions.ExceptionReceived eventHandler would be signaled with a leaselostException (with 409 conflict error on the leaseblob) - which also eventually invokes IEventProcess.Close(LeaseLost).

为什么租赁管理不能顺畅&无异常:

为了让消费者保持简单和无错误,EPH 可能会吞下与租约管理相关的异常,而根本不会通知用户代码.然而,我们意识到,抛出 LeaseLostException 可以让客户在 IEventProcessor.ProcessEvents() 回调中找到有趣的错误 - 其症状是 - 频繁的分区移动

To keep the consumer simple and error-free, lease management related exceptions could have been swallowed by EPH and not notified to the user-code at all. However, we realized, throwing LeaseLostException could empower customers to find interesting bugs in IEventProcessor.ProcessEvents() callback - for which the symptom would be - frequent partition-moves

  • 特定机器上的轻微网络中断 - 由于 EPH1 无法续订 租赁并恢复!- 想象一下,如果这台机器的 n/w 不稳定一天 - EPH 实例将使用 Partitionsping-pong!这台机器会不断尝试从其他机器上窃取租约 - 从 EPH 的角度来看这是合法的 - 但是,对于 EPH 的用户来说,这是一场彻底的灾难 -因为它完全干扰了处理管道.EPH - 当 n/w 在这个片状 m/c 上恢复时,会准确地看到 ReceiverDisconnectedException!我们认为最好的也是唯一的方法是让开发者闻到这一点!
  • 或者一个简单的场景,比如在 ProcessEvents 逻辑中有一个错误——它会抛出未处理的异常,这些异常是致命的,并会导致整个过程中断——例如:毒物事件.这个分区会移动很多.
  • 客户对 EPH 也在使用的同一存储帐户执行写入/删除操作 - 错误地(如自动清理脚本)等.
  • 最后但并非最不重要的 - 我们从不希望发生这种情况 - 在特定 EventHub.Partition 所在的 Azure dc 上说 5 分钟 outage - 说 n/w事件.分区将在 EPH 实例之间移动.
  • minor network outage on a specific machine - due to which EPH1 fails to renew leases and comes back up! - and imagine if the n/w of this machine stands flaky for a day - EPH instances are going to play ping-pong with Partitions! This machine will continuously try to steal the lease from other machine - which is legitimate from EPH point-of-view - but, is a total disaster for the user of EPH - as it completely interferes with the processing pipe. EPH - would exactly see a ReceiverDisconnectedException, when the n/w comes back up on this flaky m/c! We think the best and infact the only way is to enable the developer to smell this!
  • or a simple scenario like, having a bug in ProcessEvents logic - which throws unhandled exceptions which are fatal and brings down the whole process - ex: a poison event. This partition is going to move around a lot.
  • customers, performing write/delete operations on the same storage account which EPH is also using - by mistake (like an automated clean-up script) etc.
  • last but not the least - which we never wish could happen - say a 5 min outage on Azure d.c where a specific EventHub.Partition is located - say n/w incident. Partitions are going to move around across EPH instances.

基本上,在大多数情况下,我们很难检测差异.在这些情况和合法租约之间由于平衡而丢失,我们希望将这些情况的控制权委托给开发者.

Basically, in majority of situations, it would be tricky - for us to detect the diff. between these situations and a legitimate leaseLost due to balancing and we want to delegate control of these situations to the Developer.

更多关于事件中心...

这篇关于是什么导致 Azure 事件中心 ReceiverDisconnectedException/LeaseLostException?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆