如何在Spark Kafka直接流中手动提交偏移量? [英] How to manually commit offset in Spark Kafka direct streaming?

查看:577
本文介绍了如何在Spark Kafka直接流中手动提交偏移量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我努力地环顾四周,但没有找到满意的答案.也许我想念一些东西.请帮忙.

I looked around hard but didn't find a satisfactory answer to this. Maybe I'm missing something. Please help.

我们有一个使用Kafka主题的Spark流媒体应用程序,该应用程序需要在推进Kafka偏移量之前确保端到端处理,例如更新数据库.这很像在流系统中建立事务支持,并确保每个消息都得到处理(转换),更重要的是输出.

We have a Spark streaming application consuming a Kafka topic, which needs to ensure end-to-end processing before advancing Kafka offsets, e.g. updating a database. This is much like building transaction support within the streaming system, and guaranteeing that each message is processed (transformed) and, more importantly, output.

我已阅读有关Kafka DirectStreams的信息.它说,要在DirectStreaming模式下进行可靠的故障恢复,应启用Spark检查点,该检查点]).它没有提到我们如何(或是否)可以自定义提交偏移量(例如,一旦我们加载了数据库).换句话说,我们可以将"auto.commit.enable"设置为false并自己管理偏移量(与DB连接不同)吗?

I have read about Kafka DirectStreams. It says that for robust failure-recovery in DirectStreaming mode, Spark checkpointing should be enabled, which stores the offsets along with the checkpoints. But the offset management is done internally (setting Kafka config params like ["auto.offset.reset", "auto.commit.enable", "auto.offset.interval.ms"]). It does not speak of how (or if) we can customize committing offsets (once we've loaded a database, for e.g.). In other words, can we set "auto.commit.enable" to false and manage the offsets (not unlike a DB connection) ourselves?

任何指导/帮助都将不胜感激.

Any guidance/help is greatly appreciated.

推荐答案

下面的文章可能是理解该方法的一个很好的开始.

The article below could be a good start to understand the approach.

spark-kafka实现-zero-data-loss

更多

文章建议直接使用zookeeper客户端,也可以使用类似KafkaSimpleConsumer的客户端代替.使用Zookeper/KafkaSimpleConsumer的优点是依赖Zookeper保存的偏移量的监视工具.信息也可以保存在HDFS或任何其他可靠的服务上.

The article suggests using zookeeper client directly, which can be replaced by something like KafkaSimpleConsumer also. The advantage of using Zookeper/KafkaSimpleConsumer is the monitoring tools that depend on Zookeper saved offset. Also the information can also be saved on HDFS or any other reliable service.

这篇关于如何在Spark Kafka直接流中手动提交偏移量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆