当缓慢的使用者在流处理中产生反压(火花,aws)时,避免数据丢失 [英] Avoiding data loss when slow consumers force backpressure in stream processing (spark, aws)

查看:112
本文介绍了当缓慢的使用者在流处理中产生反压(火花,aws)时,避免数据丢失的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是分布式流处理(Spark)的新手.我已经阅读了一些教程/示例,这些教程/示例涵盖了背压如何导致生产者因过载的消费者而减慢速度的情况.给出的经典示例是摄取和分析推文.当流量出现意外增长而使用户无法承受负载时,他们会施加背压,生产者会通过将速率降低一些来做出响应.

I'm new to distributed stream processing (Spark). I've read some tutorials/examples which cover how backpressure results in the producer(s) slowing down in response to overloaded consumers. The classic example given is ingesting and analyzing tweets. When there is an unexpected spike in traffic such that the consumers are unable to handle the load, they apply backpressure and the producer responds by adjusting its rate lower.

我没有真正看到的是实践中使用什么方法来处理由于整个管道容量较低而无法立即处理的大量传入实时数据?

What I don't really see covered is what approaches are used in practice to deal with the massive amount of incoming real-time data which cannot be immediately processed due to the lower capacity of the entire pipeline?

我想这的答案取决于业务领域.对于某些问题,只删除该数据可能会很好,但是在这个问题中,我想重点介绍一个我们不想丢失任何数据的情况.

I imagine the answer to this is business domain dependent. For some problems it might be fine to just drop that data, but in this question I would like to focus on a case where we don't want to lose any data.

由于我将在AWS环境中工作,所以我的第一个想法是缓冲" SQS队列或Kinesis流中的多余数据.是真的这样简单吗,还是针对这个问题有一个更标准的流式解决方案(也许是Spark本身的一部分)?

Since I will be working in an AWS environment, my first thought would be to "buffer" the excess data in an SQS queue or a Kinesis stream. Is it as simple as this in practice, or this there a more standard streaming solution to this problem (perhaps as part of Spark itself)?

推荐答案

"是否有更标准的流媒体解决方案?"-也许.有许多不同的方法可以执行此操作,如果尚无标准",则无法立即清除.不过,这只是一种意见,您不太可能对此部分得到具体的答案.

"Is there a more standard streaming solution?" - Maybe. There are a lot of different ways to do this, not immediately clear if there is a "standard" yet. This is just an opinion though, and you're not likely to get a concrete answer for this part.

"实际上这么简单吗?"-SQS和Kinesis具有不同的使用模式:

"Is it as simple as this in practice?" - SQS and Kinesis have different usage patterns:

  • 如果要始终处理所有消息,请使用SQS, AND 具有单个逻辑使用者
    • 将其视为经典队列,需要从队列中消耗"消息.
    • 绝对是一个易于理解和使用的简单模型,但它实际上起着缓冲作用
    • Use SQS if you want to always process all messages, AND have a single logical consumer
      • think of this like a classic queue where messages need to be "consumed" from the queue.
      • definitely a simpler model to understand and get going with, but it essentially acts as a buffer

      对于您的用例来说,您有无法即时处理的大量传入实时数据",我将重点放在Kinesis而不是SQS上,因为Kinesis模型还可以更好地与其他流式传输机制如Spark/Kafka.

      For your use case where you have a "massive amount of incoming real-time data which cannot be immediately processed", I'd focus your efforts on Kinesis over SQS, as the Kinesis model also aligns better with other streaming mechanisms like Spark / Kafka.

      这篇关于当缓慢的使用者在流处理中产生反压(火花,aws)时,避免数据丢失的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆