在事件驱动的世界中处理异常 [英] Dealing with exceptions in an event driven world

查看:108
本文介绍了在事件驱动的世界中处理异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何使用微服务(使用apache kafka)在事件驱动的世界中处理异常.例如,如果您采用以下订单方案,则在完成订单之前需要执行以下操作.

I'm trying to understand how exceptions are handled in an event driven world using micro-services (using apache kafka). For example, if you take the following order scenario whereby the following actions need to happen before the order can be completed.

  • 1)向付款服务提供商授权付款
  • 2)保留库存中的物品
  • 3.1)通过支付服务提供商获取付款
  • 3.2)订购该物品
  • 4)发送电子邮件通知以接受带有收据的订单

在这种情况下的任何阶段,都可能会出现诸如以下的故障:

At any stage in this scenario, there could be a failure such as:

  • 该商品已无库存
  • 付款信息不正确
  • 收款人使用的帐户没有可用的资金
  • 诸如对支付服务提供商的外部呼叫失败,例如停机时间

您如何跟踪每个阶段都已被请求和/或完成?

How do you track that each stage has been called for and/or completed?

您如何处理出现的问题?您将如何通知失败的前端?

How do you deal with issues that arise? How would you notify the frontend of the failure?

推荐答案

您描述的某些内容不是错误或异常,而是您在分布式体系结构中应考虑的替代流程.

Some of the things you describe are not errors or exceptions, but alternative flows that you should consider in your distributed architecture.

例如,某件商品缺货在您的业务流程中是一个完全有效的替代流程.一种可能需要人为干预的方法.您可以将消息移到单独的队列中,并提供一些UI,供操作员在其中处理问题,解决问题并使事件继续进行.

For example, that an item is out of stock is a perfectly valid alternative flow in your business process. One that possibly requires human intervention. You could move the message to a separate queue and provide some UI where a human operator can deal with the problem, solve it and cause the flow of events to continue.

对于您描述的付款问题也可以说类似的话.如果无法成功解决订单,操作员将需要调查此案并解决.为此,您的设计必须考虑到替代流作为其一部分,并使其成为可能,以便当消息最终排入队列并需要人员对其进行审核时,人员可以进行某种方式的干预.

A similar thing could be said of the payment problems you describe. If an order cannot successfully be settled, a human operator will need to investigate the case and solve it. For that matter, your design must contemplate that alternative flow as part of it, and make it so a human can intervene somehow when the messages end up in a queue that requires a person to review them.

这些情况应与程序引发的错误或异常区分开来.这些情况视情况而定,实际上可能需要将消息移到死信队列(DLQ)中,以供工程师查看.

Those cases should be differentiated from errors or exceptions being thrown by the program. Those cases, depending on the circumstance, might in fact require to move the message to a dead letter queue (DLQ) for an engineer to take a look at them.

这是一个非常广泛的主题,整本书都可以撰写有关此内容的信息.

This is a very broad topic and entire books could written about this.

我相信您可能会从对以下概念的更多理解中受益:

I believe you could probably benefit from gaining more understanding of concepts like:

  • Compensating Transactions Pattern
  • Try/Cancel/Confirm Pattern
  • Long Running Transactions
  • Sagas

补偿交易的思想是,每笔交易都有其优势:如果您有一笔可以下订单的交易,那么您可以通过取消该笔交易来撤消该交易.后者是补偿性交易.因此,如果您执行了许多成功的交易,但其中一项失败,则可以追溯您的步骤并补偿您完成的每笔成功交易,从而恢复其副作用.

The idea behind compensating transactions is that every ying has its yang: if you have one transaction that can place an order, then you could undo that with a transaction that cancels that order. This latter transaction is a compensating transaction. So, if you carry out a number of successful transactions and then one of them fails, you can trace back your steps and compensate every successful transaction you did and, as a result, revert their side effects.

我特别喜欢本书从研究到练习.其第23章(通过RESTful服务实现分布式原子事务)深入介绍了 Try/Cancel/Confirm模式.

I particularly liked a chapter in the book REST from Research to Practice. Its chapter 23 (Towards Distributed Atomic Transactions over RESTful Services) goes deep in explaining the Try/Cancel/Confirm pattern.

总的来说,这意味着当您进行一组事务时,它们的副作用要等到事务协调员确认所有事务都成功后才生效.例如,如果您在Expedia进行预订,并且您的航班有两条航程与不同的航空公司,则一笔交易将为美国航空保留航班,另一笔交易将为美国联合航空保留航班.如果您的第二次预订失败,那么您想补偿第一笔预订.不仅如此,您还希望避免先保留是有效的,直到您能够确认两者为止.因此,初始交易会进行保留,但会使其副作用待确认.第二个保留将执行相同的操作.交易协调员知道所有内容都已保留后,就可以向所有各方发送确认消息,以便他们确认自己的保留.如果未在合理的时间范围内确认预订,受影响的系统将自动撤消预订.

In general terms it implies that when you do a group of transactions, their side effects are not effective until a transaction coordinator gets a confirmation that they all were successful. For example, if you make a reservation in Expedia and your flight has two legs with different airlines, then one transaction would reserve a flight with American Airlines and another one would reserve a flight with United Airlines. If your second reservation fails, then you want to compensate the first one. But not only that, you want to avoid that the first reservation is effective until you have been able to confirm both. So, initial transaction makes the reservation but keeps its side effects pending to confirm. And the second reservation would do the same. Once the transaction coordinator knows everything is reserved, it can send a confirmation message to all parties such that they confirm their reservations. If reservations are not confirmed within a sensible time window, they are automatically reversed by the affected system.

这本书企业集成模式具有一些基本知识有关如何实施这种事件协调的想法(例如,请参见流程经理模式,并与路由滑动模式进行比较, 微服务世界中的编排与编排.

The book Enterprise Integration Patterns has some basic ideas on how to implement this kind of event coordination (e.g. see process manager pattern and compare with routing slip pattern which are similar ideas to orchestration vs choreography in the Microservices world).

如您所见,根据分布式工作流程的复杂程度,补偿交易的过程可能会很复杂.流程经理可能需要跟踪每个步骤的状态,并知道何时需要撤消整个操作.在微服务领域,这几乎就是 Sagas 的想法.

As you can see, being able to compensate transactions might be complicated depending on how complex is your distributed workflow. The process manager may need to keep track of the state of every step and know when the whole thing needs to be undone. This is pretty much that idea of Sagas in the Microservices world.

微服务模式这本书的一整章都涉及到与Sagas的管理交易.有关如何实施此类解决方案的详细信息.

The book Microservices Patterns has an entire chapter called Managing Transactions with Sagas that delves in detail on how to implement this type of solution.

我通常还考虑的其他一些方面如下:

A few other aspects I also typically consider are the following:

幂等

我认为,在分布式系统中成功实施服务交易的关键在于使其成为

I believe that a key to a successful implementation of your service transactions in a distributed system consists in making them idempotent. Once you can guarantee a given service is idempotent, then you can safely retry it without worrying about causing additional side effects. However, just retrying a failed transaction won't solve your problems.

暂时性错误与持久性错误

在重试服务事务时,您不应该只是重试,因为它失败了.您必须首先知道它失败的原因,并根据错误进行重试或不重试.某些类型的错误是暂时性的,例如,如果一个事务由于查询超时而失败,则可以重试,并且很可能第二次成功;但是如果您遇到违反数据库约束的错误(例如,由于DBA向字段添加了检查约束),则重试该事务没有任何意义:无论您尝试多少次该事务都会失败.

When it comes to retrying a service transaction, you shouldn't just retry because it failed. You must first know why it failed and depending on the error it might make sense to retry or not. Some types of errors are transient, for example, if one transaction fails due to a query timeout, that's probably fine to retry and most likely it will succeed the second time; but if you get a database constraint violation error (e.g. because a DBA added a check constraint to a field), then there is no point in retrying that transaction: no matter how many times you try it will fail.

拥抱错误作为替代流程

正如我在回答开头提到的那样,并非一切都是错误.有些事情只是替代流程.

As mentioned at the beginning of my answer, not everything is an error. Some things are just alternative flows.

在服务间通信(计算机到计算机的交互)的那些情况下,当工作流的给定步骤失败时,您不一定需要撤消在先前步骤中所做的所有操作.您可以将错误视为工作流程的一部分.对可能的错误原因进行分类,并使它们成为仅需要人工干预的替代事件流.这只是整个业务流程的又一个步骤,需要一个人干预才能做出决定,解决与数据不一致的问题或只是批准采用哪种方法.

In those cases of interservice communication (computer-to-computer interactions) , when a given step of your workflow fails, you don't necessarily need to undo everything you did in previous steps. You can just embrace error as part of you workflow. Catalog the possible causes of error and make them an alternative flow of events that simply requires human intervention. It is just another step in the full orchestration that requires a person to intervene to make a decision, resolve an inconsistency with the data or just approve which way to go.

例如,也许当您正在处理订单时,付款服务会因为您没有足够的资金而失败.因此,撤消其他所有内容没有任何意义.我们需要做的就是将订单置于某种状态,以便某些问题解决者可以在系统中解决该问题,一旦解决,就可以继续进行其余的工作流程.

For example, maybe when you're processing an order, the payment service fails because you don't have enough funds. So, there is no point in undoing everything else. All we need is to put the order in a state that some problem solver can address it in the system and, once fixed, you can continue with the rest of the workflow.

事务和数据模型状态是关键

我发现这种类型的事务性工作流程需要对模型必须经历的不同状态进行良好的设计.与使用Try/Cancel/Confirm模式的情况一样,这意味着最初要应用副作用,而不必使数据模型对用户可用.

I have discovered that this type of transactional workflows require a good design of the different states your model has to go through. As in the case of Try/Cancel/Confirm pattern, this implies initially applying the side effects without necessarily making the data model available to the users.

例如,当您下订单时,也许您以待处理"状态将其添加到数据库中,而该状态不会出现在仓库系统的UI中.确认付款后,订单将出现在UI中,以便用户最终可以处理其发货.

For example, when you place an order, maybe you add it to the database in a "Pending" status that will not appear in the UI of the warehouse systems. Once payments have been confirmed the order will then appear in the UI such that a user can finally process its shipments.

这里的难题是发现如何设计事务粒度,即使事务工作流的某一步骤失败,系统仍保持有效状态,一旦纠正了失败原因,您就可以从该状态恢复.

The difficulty here is discovering how to design transaction granularity in way that even if one step of your transaction workflow fails, the system remains in a valid state from which you can resume once the cause of the failure is corrected.

设计分布式事务工作流

因此,如您所见,设计以这种方式工作的分布式系统比单独调用分布式事务服务要复杂一些.现在,每次服务调用可能由于多种原因而失败,并使分布式工作流处于不一致状态.重试事务可能并不总是能解决问题.而且,您的数据需要像状态机一样建模,以便应用副作用,但要等到整个编排成功后才能确认.

So, as you can see, designing a distributed system that works in this way is a bit more complicated than individually invoking distributed transactional services. Now every service invocation may fail for a number of reasons and leave your distributed workflow in a inconsistent state. And retrying the transaction may not always solve the problem. And your data needs to be modeled like a state machine, such that side effects are applied but not confirmed until the entire orchestration is successful.

这就是为什么可能需要采用与整体客户端-服务器应用程序不同的方式来设计整个产品的原因.解决冲突时,您的用户现在可能已成为设计的解决方案的一部分,并考虑根据解决冲突的方式,业务流程可能需要数小时甚至数天才能完成.

That‘s why the whole thing may need to be designed in a different way than you would typically do in a monolithic client–server application. Your users may now be part of the designed solution when it comes to solving conflicts, and contemplate that transactional orchestrations could potentially take hours or even days to complete depending on how their conflicts are resolved.

正如我最初所说的那样,该主题范围太广,可能需要一个更具体的问题才能详细讨论这些方面中的一个或两个.

As I was originally saying, the topic is way too broad and it would require a more specific question to discuss, perhaps, just one or two of these aspects in detail.

无论如何,我希望这对您的调查有所帮助.

At any rate, I hope this somehow helped you with your investigation.

这篇关于在事件驱动的世界中处理异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆