结构化流式Kafka源偏移存储 [英] Structured Streaming Kafka Source Offset Storage

查看:108
本文介绍了结构化流式Kafka源偏移存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Kafka的结构化流源 (集成指南),如上所述,它不会产生任何偏移.

I am using the Structured Streaming source for Kafka (Integration guide), which as stated does not commit any offset.

我的目标之一是监视它(检查它是否滞后等).即使不提交偏移量,它也会通过不时查询kafka并检查下一个要处理的偏移量来处理它们.根据文档,偏移量已写入HDFS,因此在发生故障的情况下可以将其恢复,但问题是:

One of my goals is to monitor it (check if its lagging behind etc). Even though it does not commit the offsets it handles them by querying kafka from time to time and checking which is the next one to process. According to the documentation the offsets are written to HDFS so in case of failure it can be recovered, but the question is:

它们存储在哪里? 如果没有提交偏移量(结构化的),是否有任何方法可以监视火花累积(结构化)来监视卡夫卡消耗(从程序外部;因此,卡夫卡cli或类似记录,每个记录附带的偏移量不适合用例) ?

Where are they being stored? Is there any way of monitoring the kafka consumption (from outside of the program; so a kafka cli or similar, the offset coming with each record does not suit the use case) of a spark streaing (structured) if it does not commit the offsets?

欢呼

推荐答案

kafka的结构化流可将HDFS的偏移量保存在结构下方.

Structured Streaming for kafka saves offsets to HDFS below structures.

以下示例checkpointLocation设置.

Example checkpointLocation setting is below.

.writeStream.
.....
  option("checkpointLocation", "/tmp/checkPoint")
.....

在这种情况下,kafka的结构化流保存在路径下方

In that case, Structured Streaming for kafka saves below path

/tmp/checkPoint/offsets/$'batchid'

保存的文件包含以下格式.

Saved file contains below format.

v1
{"batchWatermarkMs":0,"batchTimestampMs":$'timestamp',"conf":{"spark.sql.shuffle.partitions":"200"}}
{"Topic1WithPartiton1":{"0":$'OffsetforTopic1ForPartition0'},"Topic2WithPartiton2":{"1":$'OffsetforTopic2ForPartition1',"0":$'OffsetforTopic2ForPartition1'}}

例如.

v1
{"batchWatermarkMs":0,"batchTimestampMs":1505718000115,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"Topic1WithPartiton1":{"0":21482917},"Topic2WithPartiton2":{"1":103557997,"0":103547910}}

因此,我认为要监视偏移滞后,需要开发具有以下功能的自定义工具.

So, I think for monitoring offset lag, it needs to develop custom tools what has below functions.

  • 从HDFS的偏移量中读取.
  • 将偏移量写入Kafka __offset主题.

这样,现有的偏移滞后监视工具就可以监视结构化流中的卡夫卡偏移滞后.

That way, already existing offset lag monitoring tool can monitor Structured Streaming for kafka's offset lag.

这篇关于结构化流式Kafka源偏移存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆