从多个Kafka主题阅读Spark结构化流应用程序 [英] Spark structured streaming app reading from multiple Kafka topics
问题描述
我有一个Spark结构化的流应用程序(v2.3.2),该应用程序需要从多个Kafka主题中读取,进行一些相对简单的处理(主要是聚合和一些联接),并将结果发布到其他多个Kafka主题中.因此,在同一个应用中可以处理多个流.
I have a Spark structured streaming app (v2.3.2) which needs to read from a number of Kafka topics, do some relatively simple processing (mainly aggregations and a few joins) and publishes the results to a number of other Kafka topics. So multiple streams are processed in the same app.
我想知道它是否从资源(内存,执行程序,线程,Kafka侦听器等)的角度来看是否有所不同,如果我只设置1个直接readStream来订阅多个主题,然后用selects拆分流,相比每个主题1个readStream.
I was wondering whether it makes a difference from a resource point of view (memory, executors, threads, Kafka listeners, etc.) if I setup just 1 direct readStream which subscribes to multiple topics and then split the streams with selects, vs. 1 readStream per topic.
类似
df = spark.readStream.format("kafka").option("subscribe", "t1,t2,t3")
...
t1df = df.select(...).where("topic = 't1'")...
t2df = df.select(...).where("topic = 't2'")...
vs.
t1df = spark.readStream.format("kafka").option("subscribe", "t1")
t2df = spark.readStream.format("kafka").option("subscribe", "t2")
其中一个比另一个更有效率"吗?我找不到任何有关这是否起作用的文档.
Is either one more "efficient" than the other? I could not find any documentation about if this makes a difference.
谢谢!
推荐答案
每个动作都需要完整的沿袭执行.您最好将其分成三个独立的kafka读取.否则,您将读取每个主题N次,其中N是写入次数.
Each action requires a full lineage execution. Youre better off separating this into three separate kafka reads. Otherwise you'll read each topic N times, where N is the number of writes.
我真的建议不要这样做,但是如果您想将所有主题都放在同一阅读中,请执行以下操作:
I'd really recommend against this but if you wanted to put all the topics into the same read then do this:
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.filter().write.format(...).save(...) // location 1
batchDF.filter().write.format(...).save(...) // location 2
batchDF.unpersist()
}
这篇关于从多个Kafka主题阅读Spark结构化流应用程序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!