是什么导致Azure事件中心Hub ReceiverDisconnectedException/LeaseLostException? [英] What is causing Azure Event Hubs ReceiverDisconnectedException/LeaseLostException?
问题描述
我正在使用EventProcessorHost和IEventProcessor类(称为:MyEventProcessor)从EventHub接收事件.通过在两台服务器上运行我的EPH,并使用相同的ConsumerGroup,但使用唯一的hostName(使用计算机名),将它们连接到集线器,将其扩展到两台服务器.
I'm receiving events from an EventHub using EventProcessorHost and an IEventProcessor class (call it: MyEventProcessor). I scale this out to two servers by running my EPH on both servers, and having them connect to the Hub using the same ConsumerGroup, but unique hostName's (using the machine name).
问题是:白天/晚上的随机时间,应用会记录以下内容:
The problem is: at random hours of the day/night, the app logs this:
Exception information:
Exception type: ReceiverDisconnectedException
Exception message: New receiver with higher epoch of '186' is created hence current receiver with epoch '186' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used.
at Microsoft.ServiceBus.Common.ExceptionDispatcher.Throw(Exception exception)
at Microsoft.ServiceBus.Common.Parallel.TaskHelpers.EndAsyncResult(IAsyncResult asyncResult)
at Microsoft.ServiceBus.Messaging.IteratorAsyncResult`1.StepCallback(IAsyncResult result)
此异常与LeaseLostException同时发生,它在尝试检查点时从MyEventProcessor的CloseAsync方法抛出. (大概是因为ReceiverDisconnectedException调用了Close?)
This Exception occurs at the same time as a LeaseLostException, thrown from MyEventProcessor's CloseAsync method when it tries to checkpoint. (Presumably Close is being called because of the ReceiverDisconnectedException?)
我认为这是由于Event Hubs在扩展到多台计算机时的自动租赁管理而发生的.但是我想知道是否需要做一些不同的事情以使其更干净地工作并避免这些异常?例如:带有时代的东西?
I think this is occurring due to Event Hubs' automatic lease management when scaling out to multiple machines. But I'm wondering if I need to do something different to make it work more cleanly and avoid these Exceptions? Eg: something with epochs?
推荐答案
TLDR :此行为是绝对正常的.
为什么租赁管理不能顺利进行?无异常:将更多情况控制在开发人员手中.
漫长的故事-从基础知识开始的整个过程
EventProcessorhost
(据此EPH
-与__consumer_offset topic
对Kafka Consumers
所做的非常相似-分区所有权和检查点存储)由Microsoft Azure EventHubs
团队自己编写-将所有EventHubs partition receiver Gu
转换为简单的onReceive(Events)
回调.
EPH
用于解决2个普遍的,主要的,众所周知的问题,同时读取诸如EventHubs
的高吞吐量分区流:
TLDR: This behavior is absolutely normal.
Why can't Lease Management be smooth & exception-free: To give more control on the situation to developer.
The really long story - all-the-way from Basics
EventProcessorhost
(hereby EPH
- is very similar to what __consumer_offset topic
does for Kafka Consumers
- partition ownership & checkpoint store) is written by Microsoft Azure EventHubs
team themselves - to translate all of the EventHubs partition receiver Gu
into a simple onReceive(Events)
callback.
EPH
is used to address 2 general, major, well-known problems while reading out of a high-throughput partitioned streams like EventHubs
:
-
容错接收管道-例如:问题的一个简单版本-如果运行
PartitionReceiver
的主机死亡并返回-它需要从哪里恢复处理它离开了.为了记住上一次成功处理的EventData
,EPH
使用提供给EPH
构造函数的blob
来存储检查点-每当用户调用context.CheckpointAsync()
时.最终,当主机进程终止时(例如:突然重新启动或遇到硬件故障,再也无法恢复),任何EPH
实例都可以执行此任务并从该Checkpoint
中恢复.
fault tolerant receive pipe-line - for ex: a simpler version of the problem - if the host running a
PartitionReceiver
dies and comes back - it needs to resume processing from where it left. To remember the last successfully processedEventData
,EPH
uses theblob
supplied toEPH
constructor to store the checkpoints - when ever user invokescontext.CheckpointAsync()
. Eventually, when the host process dies (for ex: abruptly reboots or hits a hardware fault and never/comesback) - anyEPH
instance can pick up this task and resume from thatCheckpoint
.
在EPH
个实例之间平衡/分配分区-假设如果有10个分区和2个EPH
实例处理这10个分区中的事件-我们需要一种划分方法在实例之间进行分区(EPH
库的PartitionManager
组件执行此操作).我们使用 Azure Storage - Blob LeaseManagement-feature
来实现这一点.从版本 2.2.10
开始-为简化此问题,EPH
假定所有分区均加载.
Balance/distribute partitions across EPH
instances - lets say, if there are 10 partitions and 2 EPH
instances processing events from these 10 partitions - we need a way to divide partitions across the instances (PartitionManager
component of EPH
library does this). We use Azure Storage - Blob LeaseManagement-feature
to implement this. As of version 2.2.10
- to simplify the problem, EPH
assumes that all partitions are loaded equally.
有了这个,让我们尝试看看发生了什么:
因此,首先,在上面的10
事件中心分区和2
EPH
实例的示例中,处理其中的事件:
With this, lets try to see what's going on:
So, to start with, in the above example of 10
event hub partitions and 2
EPH
instances processing events out of them:
-
让
- 说第一个
EPH
实例-EPH1
最初是单独启动的,也是启动的一部分,它为所有10个分区创建了接收器,并且正在处理事件.在启动过程中-EPH1
将通过获取表示这些10
事件中心分区的10
存储blob上的租赁来声明其拥有所有这些10
分区(带有标准nomenclature
-,该内部EPH
在内部创建)在存储帐户中-从StorageConnectionString
传递到ctor
).租赁将在设定的时间获得,然后EPH
实例将失去对该分区的所有权. -
EPH1
偶尔不断地announces
-仍然拥有这些分区-通过renewing
在blob上的租约.renewal
的频率以及其他有用的调整可以使用 ReceiverDisconnectedException .最终,这将调用IEventProcessor.Close(CloseReason=LeaseLost)
.请注意,如果接收到的消息较大或PrefetchCount
较小,则遇到此特定Exception的可能性更高-因为在两种情况下,接收者都将执行更具攻击性的I/O. - 如果
EPH1
处于checkpointing
,lease
或renewing
lease
的状态,而EPH2
stole
租约,则EventProcessorOptions.ExceptionReceived
eventHandler将用leaselostException
(在leaseblob
上出现409
冲突错误)-最终还会调用IEventProcess.Close(LeaseLost)
.
- lets say the first
EPH
instance -EPH1
started, at-first, alone and a part of start-up, it created receivers to all 10 partitions and is processing events. In the start up -EPH1
will announce that it owns all these10
partitions by acquiring Leases on10
storage blobs representing these10
event hub partitions (with a standardnomenclature
- whichEPH
internally creates in the Storage account - from theStorageConnectionString
passed to thector
). Leases will be acquired for a set time, after which theEPH
instance will loose the ownership on this Partition. EPH1
continuallyannounces
once in a while - that it is still owning those partitions - byrenewing
leases on the blob. Frequency ofrenewal
, along with other useful tuning, can be performed usingPartitionManagerOptions
- now, lets say,
EPH2
starts up - and you supplied the sameAzureStorageAccount
asEPH1
to thector
ofEPH2
as well. Right now, it has0
partitions to process. So, to achieve balance of partitions acrossEPH
instances, it will go ahead anddownload
the list of allleaseblobs
which has the mapping ofowner
topartitionId
. From this, it willSTEAL
leases for its fair share ofpartitions
- which is5
in our example, and will announce that information on thatlease blob
. As part of this,EPH2
reads the latest checkpoint written byPartitionX
it wants to steal the lease for and goes ahead and creates correspondingPartitionReceiver
's with theEPOCH
same as the one in theCheckpoint
. - As a result,
EPH1
will loose ownership of these 5partitions
and will run into different errors based on the exact state it is in.- if
EPH1
is actually invoking thePartitionReceiver.Receive()
call - whileEPH2
is creating thePartitionReceiver
on the same receiver -EPH1
will experience ReceiverDisconnectedException. This will eventually, invokeIEventProcessor.Close(CloseReason=LeaseLost)
. Note that, probability of hitting this specific Exception is higher, if the messages being received are larger or thePrefetchCount
is smaller - as in both cases the receiver would be performing more aggressive I/O. - if
EPH1
is in the state ofcheckpointing
thelease
orrenewing
thelease
, while theEPH2
stole
the lease, theEventProcessorOptions.ExceptionReceived
eventHandler would be signaled with aleaselostException
(with409
conflict error on theleaseblob
) - which also eventually invokesIEventProcess.Close(LeaseLost)
.
- if
为什么租赁管理不能顺利进行?无异常:
为使用户保持简单和无错误,EPH
可能吞下了与租赁管理相关的异常,而根本没有通知用户代码.但是,我们意识到,抛出LeaseLostException
可能使客户能够在IEventProcessor.ProcessEvents()
回调中找到有趣的错误-症状可能是-频繁进行分区移动
To keep the consumer simple and error-free, lease management related exceptions could have been swallowed by EPH
and not notified to the user-code at all. However, we realized, throwing LeaseLostException
could empower customers to find interesting bugs in IEventProcessor.ProcessEvents()
callback - for which the symptom would be - frequent partition-moves
-
特定计算机上的
- 小型网络中断-由于
EPH1
无法租借并恢复renew
! -想象一下,如果这台机器的n/w整天都不稳定,-EPH
实例将与Partitions
一起玩ping-pong
!这台机器将不断尝试从其他机器上窃取租约-从EPH
的角度来看这是合法的-但对于EPH
的用户来说是完全的灾难-因为它完全干扰了处理管道.EPH
-当n/w恢复到该片状m/c时,将恰好看到ReceiverDisconnectedException
!我们认为最好的也是唯一的方法是使开发人员能够体会到这一点! - 或简单的场景,例如,在
ProcessEvents
逻辑中存在错误-抛出未处理的异常,这些异常是致命的,并破坏了整个过程-例如:一次毒害事件.该分区将移动很多. - 客户,在与
EPH
也在使用的同一存储帐户上执行写入/删除操作-错误(例如自动清理脚本)等. - 最后但并非最不重要的一点-我们从未希望发生过-说在特定
EventHub.Partition
所在的Azure d.c上5分钟的outage
-说n/w事件.分区将在EPH
个实例之间移动.
- minor network outage on a specific machine - due to which
EPH1
fails torenew
leases and comes back up! - and imagine if the n/w of this machine stands flaky for a day -EPH
instances are going to playping-pong
withPartitions
! This machine will continuously try to steal the lease from other machine - which is legitimate fromEPH
point-of-view - but, is a total disaster for the user ofEPH
- as it completely interferes with the processing pipe.EPH
- would exactly see aReceiverDisconnectedException
, when the n/w comes back up on this flaky m/c! We think the best and infact the only way is to enable the developer to smell this! - or a simple scenario like, having a bug in
ProcessEvents
logic - which throws unhandled exceptions which are fatal and brings down the whole process - ex: a poison event. This partition is going to move around a lot. - customers, performing write/delete operations on the same storage account which
EPH
is also using - by mistake (like an automated clean-up script) etc. - last but not the least - which we never wish could happen - say a 5 min
outage
on Azure d.c where a specificEventHub.Partition
is located - say n/w incident. Partitions are going to move around acrossEPH
instances.
基本上,在大多数情况下,这很棘手-对于我们而言,要检测差异.在这些情况和合法的租约之间,由于平衡而丢失了,我们希望将对这些情况的控制权委派给开发人员.
Basically, in majority of situations, it would be tricky - for us to detect the diff. between these situations and a legitimate leaseLost due to balancing and we want to delegate control of these situations to the Developer.
这篇关于是什么导致Azure事件中心Hub ReceiverDisconnectedException/LeaseLostException?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!