在 Spark 批处理作业中读取 Kafka 主题 [英] Read Kafka topic in a Spark batch job

查看：38 发布时间：2021/11/12 1:48:40 scala apache-spark apache-kafka spark-streaming kafka-consumer-api

本文介绍了在 Spark 批处理作业中读取 Kafka 主题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在编写一个从 Kafka 主题读取的 Spark (v1.6.0) 批处理作业.
为此，我可以使用 org.apache.spark.streaming.kafka.KafkaUtils#createRDD 但是，我需要为所有分区设置偏移量，还需要将它们存储在某个地方(ZK?HDFS?)以了解从哪里开始下一个批处理作业.

I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD however, I need to set the offsets for all the partitions and also need to store them somewhere (ZK? HDFS?) to know from where to start the next batch job.

在批处理作业中读取 Kafka 的正确方法是什么?

What is the right approach to read from Kafka in a batch job?

我也在考虑编写一个流媒体作业，它从 auto.offset.reset=smallest 读取并保存检查点到 HDFS，然后在下一次运行中从那里开始.

I'm also thinking about writing a streaming job instead, which reads from auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the next run it starts from that.

但是在这种情况下，我怎样才能在第一批之后只获取一次并停止流式传输?

But in this case how can I just fetch once and stop streaming after the first batch?

在 Spark 批处理作业中读取 Kafka 主题 [英] Read Kafka topic in a Spark batch job

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Spark 批处理作业中读取 Kafka 主题 [英] Read Kafka topic in a Spark batch job

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭