Spark结构化流-将不同的Windows用于不同的GroupBy键 [英] Spark Structured streaming- Using different Windows for different GroupBy Keys

查看:270
本文介绍了Spark结构化流-将不同的Windows用于不同的GroupBy键的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前,在通过Spark结构化流媒体从Kafka主题中阅读后,我有了下表

Currently i have following table after reading from a Kafka topic via spark structured streaming

key,timestamp,value  
-----------------------------------
key1,2017-11-14 07:50:00+0000,10    
key1,2017-11-14 07:50:10+0000,10  
key1,2017-11-14 07:51:00+0000,10    
key1,2017-11-14 07:51:10+0000,10    
key1,2017-11-14 07:52:00+0000,10    
key1,2017-11-14 07:52:10+0000,10  

key2,2017-11-14 07:50:00+0000,10  
key2,2017-11-14 07:51:00+0000,10  
key2,2017-11-14 07:52:10+0000,10  
key2,2017-11-14 07:53:00+0000,10  

我想为每个键使用不同的窗口并执行汇总

I would like to use different windows for each of the keys and perform aggregation

例如
key1将在1分钟的时间范围内汇总以产生

for example
key1 would be aggregated on window of 1 minute to yield

key,window,sum
------------------------------------------
key1,[2017-11-14 07:50:00+0000,2017-11-14 07:51:00+0000],20  
key1,[2017-11-14 07:51:00+0000,2017-11-14 07:52:00+0000],20  
key1,[2017-11-14 07:52:00+0000,2017-11-14 07:53:00+0000],20  

key2将在2分钟的时间范围内汇总以产生

key2 would be aggregated on window of 2 minutes to yield

key,window,sum
------------------------------------------
key2,[2017-11-14 07:50:00+0000,2017-11-14 07:52:00+0000],20  
key2,[2017-11-14 07:52:00+0000,2017-11-14 07:54:00+0000],20  

当前我正在执行以下操作

Currently i do the following

var l1 = List(List(key1,"60 seconds"),List(key2,"120 seconds"))  
l1.foreach{list => 

    val filtered_df = df.filter($"key" === list(0))

    val windowedPlantSum = filtered_df
        .withWatermark("timestamp", "120 minutes")
        .groupBy(
          window($"timestamp", list(1)),
          $"key"
        )
        .agg(sum("value").alias("sum")

    //start the stream
}

以上方法启动2个独立的流.在我的情况下,有200个这样的键可以启动200个流,但由于内存问题而失败.

The above approach starts 2 separate streams. In my case there are 200 such keys which starts 200 streams which fails due to memory issue.

Spark结构化流中是否有任何方法可以基于Keys指定窗口?还是有其他方法?

Is there any way to specify window based on Keys in Spark structured streaming or is there any other approaches?

推荐答案

我想您必须使用mapGroupsWithState仅管理一个查询

I guess you have to use mapGroupsWithState to only manage one query

从幻灯片28开始: https ://www.slideshare.net/databricks/arbitrary-stateful-aggregations-using-structured-streaming-in-apache-spark

还有:

  • Arbitrary Stateful Processing in Apache Spark’s Structured Streaming
  • Deep dive stateful stream processing
  • Official documentation

这篇关于Spark结构化流-将不同的Windows用于不同的GroupBy键的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆