在Spark中使用Windows函数进行每周汇总 [英] Weekly Aggregation using Windows Function in Spark

查看:176
本文介绍了在Spark中使用Windows函数进行每周汇总的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有从2017年1月1日到2017年1月7日开始的数据,这是每周需要的每周汇总.我以下列方式使用了窗口功能

I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner

val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
      .agg(sum("Value") as "aggregate_sum")
      .select("window.start", "window.end", "aggregate_sum")

我在数据框中的数据为

    DateTime,value
    2017-01-01T00:00:00.000+05:30,1.2
    2017-01-01T00:15:00.000+05:30,1.30
--
    2017-01-07T23:30:00.000+05:30,1.43
    2017-01-07T23:45:00.000+05:30,1.4

我得到的输出为:

2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74

它显示我的一天是从2016年12月29日开始,但实际数据是从2017年1月1日开始,为什么会出现这种保证金?

It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?

推荐答案

对于像这样的滚动窗口,可以设置开始时间的偏移量,有关更多信息,请参见博客

For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.

语法如下,

window(column, window duration, sliding duration, starting offset)

使用您的值,我发现偏移量为64小时的起始时间为2017-01-01 00:00:00.

With your values I found that an offset of 64 hours would give a starting time of 2017-01-01 00:00:00.

val data = Seq(("2017-01-01 00:00:00",1.0),
               ("2017-01-01 00:15:00",2.0),
               ("2017-01-08 23:30:00",1.43))
val df = data.toDF("DateTime","value")
  .withColumn("DateTime", to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss"))

val df2 = df
  .groupBy(window(col("DateTime"), "1 week", "1 week", "64 hours"))
  .agg(sum("value") as "aggregate_sum")
  .select("window.start", "window.end", "aggregate_sum")

将给出此结果数据框:

+-------------------+-------------------+-------------+
|              start|                end|aggregate_sum|
+-------------------+-------------------+-------------+
|2017-01-01 00:00:00|2017-01-08 00:00:00|          3.0|
|2017-01-08 00:00:00|2017-01-15 00:00:00|         1.43|
+-------------------+-------------------+-------------+

这篇关于在Spark中使用Windows函数进行每周汇总的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆