如何通过Spark结构化流媒体确保kafka数据摄取不会丢失数据? [英] How to ensure no data loss for kafka data ingestion through Spark Structured Streaming?

查看:131
本文介绍了如何通过Spark结构化流媒体确保kafka数据摄取不会丢失数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个运行了很长时间的Spark结构化流媒体作业,正在吸收kafka数据.我有一个问题如下.如果作业由于某种原因而失败并在以后重新启动,则如何确保从断点处提取kafka数据,而不是在作业重新启动时始终获取当前数据和以后的数据.我是否需要明确指定诸如使用者组和auto.offet.reset之类的东西?火花卡夫卡摄取中是否支持它们?谢谢!

I have a long running spark structured streaming job which is ingesting kafka data. I have one concern as below. If the job is failed due to some reason and restart later, how to ensure kafka data will be ingested from the breaking point instead of always ingesting current and later data when the job is restarting. Do I need to specifiy explicitly something like consumer group and auto.offet.reset, etc? Are they supported in spark kafka ingestion? Thanks!

推荐答案

根据

According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.

我在另一个以下是您的Spark结构化流作业最重要的Kafka特定配置:

Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:

group.id :Kafka源将自动为每个查询创建一个唯一的组ID.根据代码group.id将自动设置为

group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will automatically be set to

val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}

auto.offset.reset :设置源选项startingOffsets以指定从何处开始. 结构化流技术管理哪些偏移量是在内部使用的,而不是依靠kafka消费者来完成

auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it

enable.auto.commit :Kafka源未提交任何偏移量.

enable.auto.commit: Kafka source doesn’t commit any offset.

因此,当前无法在结构化流中定义您的自定义group.id.Kafka使用者和结构化流将在内部管理偏移量,并且不会重新提交给Kafka(也不会自动).

Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).

这篇关于如何通过Spark结构化流媒体确保kafka数据摄取不会丢失数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆