从 Kafka 批量写入不观察检查点并写入重复项 [英] Batch write from to Kafka does not observe checkpoints and writes duplicates

查看:19
本文介绍了从 Kafka 批量写入不观察检查点并写入重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

跟进我之前的问题:我正在从 Databricks 到 Kafka 批量写入一个大型数据帧.现在这通常工作正常.但是,有时会出现一些错误(主要是超时).重试开始,处理将重新开始.但这似乎没有观察到检查点,导致重复写入 Kafka 接收器.

Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.

那么检查点是否应该在批量写入模式下工作?还是我遗漏了什么?

So should checkpoints work in batch-writing mode at all? Or I am missing something?

配置:

EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()

推荐答案

Spark 检查点往往会导致重复.从 Zookeeper 存储和读取 Offset 可以解决这个问题.这是详细信息的链接:

Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :

http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html

此外,在您的情况下,检查点根本不起作用或检查点导致重复?以上 URL 帮助适用于后一种情况.

Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.

这篇关于从 Kafka 批量写入不观察检查点并写入重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆