Spark Structured Streaming Kafka Integration Offset 管理 [英] Spark Structured Streaming Kafka Integration Offset management
问题描述
文档说:
enable.auto.commit:Kafka 源不提交任何偏移量.
enable.auto.commit: Kafka source doesn’t commit any offset.
因此我的问题是,如果工作程序或分区崩溃/重启:
Hence my question is, in the event of a worker or partition crash/restart :
- startingOffsets 设置为最新,我们如何不丢失消息?
- startingOffsets 设置为最早,我们如何不重新处理所有消息?
这似乎很重要.有关如何处理的任何指示?
This is seems to be quite important. Any indication on how to deal with it ?
推荐答案
我也遇到了这个问题.
您对 2 个选项的观察是正确的,即
You're right in your observations on the 2 options i.e.
- 如果
startingOffsets
设置为latest
可能会丢失数据 - 如果
startingOffsets
设置为earliest
,则重复数据
- potential data loss if
startingOffsets
is set tolatest
- duplicate data if
startingOffsets
is set toearliest
不过……
通过添加以下选项可以选择检查点:
There is the option of checkpointing by adding the following option:
.writeStream.<别的东西>.option("checkpointLocation", "path/to/HDFS/dir").<其他东西>
如果发生故障,Spark 将遍历此检查点目录的内容,在接受任何新数据之前恢复状态.
In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.
我在以下网站上找到了这个有用的参考一样.
I found this useful reference on the same.
希望这会有所帮助!
这篇关于Spark Structured Streaming Kafka Integration Offset 管理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!