从多个 Kafka 主题读取 Spark 结构化流应用程序 [英] Spark structured streaming app reading from multiple Kafka topics

查看：44 发布时间：2021/11/12 2:17:06 apache-spark apache-kafka spark-structured-streaming

本文介绍了从多个 Kafka 主题读取 Spark 结构化流应用程序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 Spark 结构化流应用程序 (v2.3.2)，它需要从许多 Kafka 主题中读取数据，进行一些相对简单的处理(主要是聚合和一些连接)并将结果发布到许多其他 Kafka 主题.因此在同一个应用程序中处理多个流.

I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app.

我想知道，如果我只设置 1 个订阅多个主题的直接 readStream，然后使用选择拆分流，从资源的角度(内存、执行程序、线程、Kafka 侦听器等)来看，它是否会有所不同，对比每个主题 1 个 readStream.

I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then split the streams with selects, vs. 1 readStream per topic.

类似的东西

df = spark.readStream.format("kafka").option("subscribe", "t1,t2,t3")
...
t1df = df.select(...).where("topic = 't1'")...
t2df = df.select(...).where("topic = 't2'")...

对比

t1df = spark.readStream.format("kafka").option("subscribe", "t1")
t2df = spark.readStream.format("kafka").option("subscribe", "t2")

其中一个比另一个更有效"吗?我找不到任何关于这是否有所作为的文档.

Is either one more "efficient" than the other? I could not find any documentation about if this makes a difference.

谢谢！

推荐答案

每个操作都需要完整的沿袭执行.你最好把它分成三个单独的 kafka 读取.否则，您将阅读每个主题 N 次，其中 N 是写入次数.

Each action requires a full lineage execution. Youre better off separating this into three separate kafka reads. Otherwise you'll read each topic N times, where N is the number of writes.

我真的不建议这样做，但如果您想将所有主题放在同一个阅读中，请执行以下操作:

I'd really recommend against this but if you wanted to put all the topics into the same read then do this:

streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  batchDF.persist()
  batchDF.filter().write.format(...).save(...)  // location 1
  batchDF.filter().write.format(...).save(...)  // location 2
  batchDF.unpersist()
}

这篇关于从多个 Kafka 主题读取 Spark 结构化流应用程序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从多个 Kafka 主题读取 Spark 结构化流应用程序 [英] Spark structured streaming app reading from multiple Kafka topics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从多个 Kafka 主题读取 Spark 结构化流应用程序 [英] Spark structured streaming app reading from multiple Kafka topics

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭