如何通过 Spark Structured Streaming 确保 kafka 数据摄取不会丢失数据? [英] How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

查看:47
本文介绍了如何通过 Spark Structured Streaming 确保 kafka 数据摄取不会丢失数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个长期运行的 Spark 结构化流作业,它正在摄取 kafka 数据.我有一个问题如下.如果作业由于某种原因失败并稍后重新启动,如何确保从断点处摄取kafka数据,而不是在作业重新启动时始终摄取当前和以后的数据.我是否需要明确指定诸如消费者组和 auto.offet.reset 等内容?它们是否支持 spark kafka 摄取?谢谢!

解决方案

根据 Spark 结构化集成指南,Spark 本身会跟踪偏移量,并且没有提交回 Kafka 的偏移量.这意味着如果您的 Spark Streaming 作业失败并且您重新启动它,所有关于偏移量的必要信息都存储在 Spark 的检查点文件中.这样您的应用程序就会知道它在哪里停止并继续处理剩余的数据.

我在另一个post

以下是 Spark Structured Streaming 作业最重要的 Kafka 特定配置:

<块引用>

group.id:Kafka 源会自动为每个查询创建一个唯一的组 ID.根据代码,group.id 将自动设置为

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}

<块引用>

auto.offset.reset:设置源选项startingOffsets 以指定从哪里开始.Structured Streaming 管理内部消耗哪些偏移量,而不是依赖 kafka Consumer 来做

<块引用>

enable.auto.commit:Kafka 源不提交任何偏移量.

因此,在 Structured Streaming 中,目前无法为 Kafka Consumer 定义您的自定义 group.id,并且 Structured Streaming 在内部管理偏移量,而不是提交回 Kafka(也不会自动提交).

I have a long running spark structured streaming job which is ingesting kafka data. I have one concern as below. If the job is failed due to some reason and restart later, how to ensure kafka data will be ingested from the breaking point instead of always ingesting current and later data when the job is restarting. Do I need to specifiy explicitly something like consumer group and auto.offet.reset, etc? Are they supported in spark kafka ingestion? Thanks!

解决方案

According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.

I have written more details about setting group.id and Spark's checkpointing of offsets in another post

Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:

group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will automatically be set to

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}

auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it

enable.auto.commit: Kafka source doesn’t commit any offset.

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

这篇关于如何通过 Spark Structured Streaming 确保 kafka 数据摄取不会丢失数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆