亚马逊 Kinesis &AWS Lambda 重试 [英] Amazon Kinesis & AWS Lambda Retries

查看:27
本文介绍了亚马逊 Kinesis &AWS Lambda 重试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 Amazon Kinesis 非常陌生,所以也许这只是我理解中的一个问题,但在 AWS Lambda 常见问题 它说:

<块引用>

发送到您的 AWS Lambda 函数的 Amazon Kinesis 和 DynamoDB Streams 记录按每个分片严格序列化.这意味着,如果您将两条记录放在同一个分片中,Lambda 会保证您的 Lambda 函数将在使用第二条记录调用之前使用第一条记录成功调用.如果对一条记录的调用超时、受到限制或遇到任何其他错误,Lambda 将重试直到成功(或记录达到其 24 小时到期),然后再继续下一条记录.不保证跨不同分片的记录排序,每个分片的处理是并行发生的.

我的问题是,如果由于某种原因,某些格式错误的数据被生产者放入一个分片,当 Lambda 函数发现它时出错,然后不断地重试,会发生什么?这意味着该特定分片的处理将被错误阻止 24 小时.

通过将问题包装在自定义错误中并将此错误与所有成功处理的记录一起发送到下游并让消费者处理它来处理此类应用程序错误的最佳做法是什么?当然,如果出现不可恢复的错误导致程序像空指针一样崩溃,这仍然无济于事:在接下来的 24 小时内,我们将再次回到阻塞重试循环.

解决方案

别想太多,Kinesis 只是一个队列.您必须成功消费一条记录(即从队列中弹出)才能继续下一条记录.就像一个先进先出堆栈.

适当的方法应该是:

  • 从流中获取记录.
  • 在 try-catch-finally 块中处理它.
  • 如果记录处理成功,没问题.<- 尝试
  • 但如果失败,请记下到另一个地方调查失败的原因.<- 抓住
  • 并且在逻辑块的末尾,始终保持位置以动态数据库.<- 终于
  • 如果您的系统出现内部问题(内存错误、硬件错误等)那是另一个故事;因为它可能会影响处理所有记录,而不仅仅是一个.

顺便说一下,如果处理一条记录的时间超过 1 分钟,很明显你做错了什么.由于 Kinesis 旨在每秒处理数千条记录,因此您不应奢侈地为每个记录处理如此长的作业.

您问的问题是队列系统的一般问题,有时称为有毒消息".您必须在业务逻辑中处理它们以确保安全.

http://www.cogin.com/articles/Sur​​vivingPoisonMessages.php#PoisonMessages

I'm very new to Amazon Kinesis so maybe this is just a problem in my understanding but in the AWS Lambda FAQ it says:

The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel.

My question is, what happens if for some reason some malformed data gets put onto a shard by a producer and when the Lambda function picks it up it errors out and then just keeps retrying constantly? This then means that the processing of that particular shard would be blocked for 24 hours by the error.

Is the best practice to handle application errors like that by wrapping the problem in a custom error and sending this error downstream along with all the successfully processed records and let the consumer handle it? Of course, this still wouldn't help in the case of an unrecoverable error that crashed the program like a null pointer: again we'd be back to the blocking retry loop for the next 24 hours.

解决方案

Don't overthink it, the Kinesis is just a queue. You have to consume a record (ie. pop from the queue) successfully in order to proceed to the next one. Just like a FIFO stack.

The appropriate approach should be:

  • Get a record from stream.
  • Process it in a try-catch-finally block.
  • If the record is processed successfully, no problem. <- TRY
  • But if it fails, note it down to another place to investigate the reason why it failed. <- CATCH
  • And at the end of your logic blocks, always persist the position to DynamoDB. <- FINALLY
  • If an internal occurs in your system (memory error, hardware error etc) that is another story; as it may affect processing all of the records, not just one.

By the way, if processing of a record takes more than 1 minute, it is obvious you are doing something wrong. Because Kinesis is designed to handle thousands of records per second, you should not have the luxury of processing such long jobs for each of them.

The question you are asking is a general problem of queue systems, sometimes called "poisonous message". You have to handle them in your business logic to be safe.

http://www.cogin.com/articles/SurvivingPoisonMessages.php#PoisonMessages

这篇关于亚马逊 Kinesis &amp;AWS Lambda 重试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆