从批处理写入Kafka不会遵守检查点,并且会重复写入 [英] Batch write from to Kafka does not observe checkpoints and writes duplicates

查看:132
本文介绍了从批处理写入Kafka不会遵守检查点,并且会重复写入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前的问题的后续操作:我正在从Databricks到Kafka批量编写一个大型数据框。现在,这通常可以正常工作。但是,有时会出现一些错误(主要是超时)。重试启动,处理将重新开始。但这似乎没有遵守检查点,导致重复项被写入Kafka接收器。

Follow-up from my previous question: I'm writing a large dataframe in a batch from Databricks to Kafka. This generally works fine now. However, some times there are some errors (mostly timeouts). Retrying kicks in and processing will start over again. But this does not seem to observe the checkpoint, which results in duplicates being written to the Kafka sink.

那么检查点是否应该完全以批处理写入模式工作?还是我丢失了什么?

So should checkpoints work in batch-writing mode at all? Or I am missing something?

配置:

EH_SASL = 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="$ConnectionString" password="Endpoint=sb://myeventhub.servicebus.windows.net/;SharedAccessKeyName=RootManageSharedAccessKey;SharedAccessKey=****";'

dfKafka \
.write  \
.format("kafka") \
.option("kafka.sasl.mechanism", "PLAIN") \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.jaas.config", EH_SASL) \
.option("kafka.bootstrap.servers", "myeventhub.servicebus.windows.net:9093") \
.option("topic", "mytopic") \
.option("checkpointLocation", "/mnt/telemetry/cp.txt") \
.save()


推荐答案

火花检查点往往会导致重复。从Zookeeper存储和读取偏移量可以解决此问题。这是详细信息的链接:

Spark checkpoints tend to cause duplicates . Storing and reading Offset from Zookeeper may solve this issue. Here is the link for details :

http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html

另外,在您的情况下,检查点根本无法工作或检查点导致重复吗?上面的URL帮助适用于以后的情况。

Also, in your case , checkpoints are not working at all or checkpoints are causing duplicates ? Above URL help is for the later case.

这篇关于从批处理写入Kafka不会遵守检查点,并且会重复写入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆