Spark结构化流-将不同的Windows用于不同的GroupBy键 [英] Spark Structured streaming- Using different Windows for different GroupBy Keys

查看：270 发布时间：2020/9/4 8:51:07 scala apache-spark apache-spark-sql spark-dataframe spark-streaming

本文介绍了Spark结构化流-将不同的Windows用于不同的GroupBy键的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

目前，在通过Spark结构化流媒体从Kafka主题中阅读后，我有了下表

Currently i have following table after reading from a Kafka topic via spark structured streaming

key,timestamp,value  
-----------------------------------
key1,2017-11-14 07:50:00+0000,10    
key1,2017-11-14 07:50:10+0000,10  
key1,2017-11-14 07:51:00+0000,10    
key1,2017-11-14 07:51:10+0000,10    
key1,2017-11-14 07:52:00+0000,10    
key1,2017-11-14 07:52:10+0000,10  

key2,2017-11-14 07:50:00+0000,10  
key2,2017-11-14 07:51:00+0000,10  
key2,2017-11-14 07:52:10+0000,10  
key2,2017-11-14 07:53:00+0000,10

我想为每个键使用不同的窗口并执行汇总

I would like to use different windows for each of the keys and perform aggregation

例如
key1将在1分钟的时间范围内汇总以产生

for example
key1 would be aggregated on window of 1 minute to yield

key,window,sum
------------------------------------------
key1,[2017-11-14 07:50:00+0000,2017-11-14 07:51:00+0000],20  
key1,[2017-11-14 07:51:00+0000,2017-11-14 07:52:00+0000],20  
key1,[2017-11-14 07:52:00+0000,2017-11-14 07:53:00+0000],20

key2将在2分钟的时间范围内汇总以产生

key2 would be aggregated on window of 2 minutes to yield

key,window,sum
------------------------------------------
key2,[2017-11-14 07:50:00+0000,2017-11-14 07:52:00+0000],20  
key2,[2017-11-14 07:52:00+0000,2017-11-14 07:54:00+0000],20

当前我正在执行以下操作

Currently i do the following

var l1 = List(List(key1,"60 seconds"),List(key2,"120 seconds"))  
l1.foreach{list => 

    val filtered_df = df.filter($"key" === list(0))

    val windowedPlantSum = filtered_df
        .withWatermark("timestamp", "120 minutes")
        .groupBy(
          window($"timestamp", list(1)),
          $"key"
        )
        .agg(sum("value").alias("sum")

    //start the stream
}

以上方法启动2个独立的流.在我的情况下，有200个这样的键可以启动200个流，但由于内存问题而失败.

The above approach starts 2 separate streams. In my case there are 200 such keys which starts 200 streams which fails due to memory issue.

Spark结构化流中是否有任何方法可以基于Keys指定窗口?还是有其他方法?

Is there any way to specify window based on Keys in Spark structured streaming or is there any other approaches?

Spark结构化流-将不同的Windows用于不同的GroupBy键 [英] Spark Structured streaming- Using different Windows for different GroupBy Keys

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark结构化流-将不同的Windows用于不同的GroupBy键 [英] Spark Structured streaming- Using different Windows for different GroupBy Keys

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭