结构化流式 Kafka 源偏移存储 [英] Structured Streaming Kafka Source Offset Storage

查看:35
本文介绍了结构化流式 Kafka 源偏移存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Kafka 的结构化流源(集成指南),如前所述,它没有提交任何偏移量.

I am using the Structured Streaming source for Kafka (Integration guide), which as stated does not commit any offset.

我的目标之一是监控它(检查它是否落后等).即使它没有提交偏移量,它也会通过不时查询 kafka 并检查下一个要处理的偏移量来处理它们.根据文档,偏移量被写入 HDFS,因此在发生故障时可以恢复,但问题是:

One of my goals is to monitor it (check if its lagging behind etc). Even though it does not commit the offsets it handles them by querying kafka from time to time and checking which is the next one to process. According to the documentation the offsets are written to HDFS so in case of failure it can be recovered, but the question is:

它们存放在哪里?如果不提交偏移量,是否有任何方法可以监视火花流(结构化)的 kafka 消耗(从程序外部;所以 kafka cli 或类似的,每个记录附带的偏移量不适合用例)?

Where are they being stored? Is there any way of monitoring the kafka consumption (from outside of the program; so a kafka cli or similar, the offset coming with each record does not suit the use case) of a spark streaing (structured) if it does not commit the offsets?

干杯

推荐答案

kafka 的结构化流将偏移量保存到结构下方的 HDFS.

Structured Streaming for kafka saves offsets to HDFS below structures.

示例 checkpointLocation 设置如下.

Example checkpointLocation setting is below.

.writeStream.
.....
  option("checkpointLocation", "/tmp/checkPoint")
.....

在这种情况下,kafka 的结构化流保存在路径下方

In that case, Structured Streaming for kafka saves below path

/tmp/checkPoint/offsets/$'batchid'

保存的文件包含以下格式.

Saved file contains below format.

v1
{"batchWatermarkMs":0,"batchTimestampMs":$'timestamp',"conf":{"spark.sql.shuffle.partitions":"200"}}
{"Topic1WithPartiton1":{"0":$'OffsetforTopic1ForPartition0'},"Topic2WithPartiton2":{"1":$'OffsetforTopic2ForPartition1',"0":$'OffsetforTopic2ForPartition1'}}

例如.

v1
{"batchWatermarkMs":0,"batchTimestampMs":1505718000115,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"Topic1WithPartiton1":{"0":21482917},"Topic2WithPartiton2":{"1":103557997,"0":103547910}}

所以,我认为为了监控偏移滞后,需要开发具有以下功能的自定义工具.

So, I think for monitoring offset lag, it needs to develop custom tools what has below functions.

  • 从 HDFS 的偏移量读取.
  • 将偏移量写入 Kafka __offset 主题.

这样,现有的偏移滞后监控工具可以监控结构化流媒体的 kafka 偏移滞后.

That way, already existing offset lag monitoring tool can monitor Structured Streaming for kafka's offset lag.

这篇关于结构化流式 Kafka 源偏移存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆