在Spark中使用Windows函数进行每周汇总 [英] Weekly Aggregation using Windows Function in Spark
问题描述
我有从2017年1月1日到2017年1月7日开始的数据,这是每周需要的每周汇总.我以下列方式使用了窗口功能
I have data which starts from 1st Jan 2017 to 7th Jan 2017 and it is a week wanted weekly aggregate. I used window function in following manner
val df_v_3 = df_v_2.groupBy(window(col("DateTime"), "7 day"))
.agg(sum("Value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
我在数据框中的数据为
DateTime,value
2017-01-01T00:00:00.000+05:30,1.2
2017-01-01T00:15:00.000+05:30,1.30
--
2017-01-07T23:30:00.000+05:30,1.43
2017-01-07T23:45:00.000+05:30,1.4
我得到的输出为:
2016-12-29T05:30:00.000+05:30,2017-01-05T05:30:00.000+05:30,723.87
2017-01-05T05:30:00.000+05:30,2017-01-12T05:30:00.000+05:30,616.74
它显示我的一天是从2016年12月29日开始,但实际数据是从2017年1月1日开始,为什么会出现这种保证金?
It shows that my day is starting from 29th Dec 2016 but in actual data is starting from 1 Jan 2017,why this margin is occuring?
推荐答案
对于像这样的滚动窗口,可以设置开始时间的偏移量,有关更多信息,请参见博客
For tumbling windows like this it is possible to set an offset to the starting time, more information can be found in the blog here. A sliding window is used, however, by setting both "window duration" and "sliding duration" to the same value, it will be the same as a tumbling window with starting offset.
语法如下,
window(column, window duration, sliding duration, starting offset)
使用您的值,我发现偏移量为64小时的起始时间为2017-01-01 00:00:00
.
With your values I found that an offset of 64 hours would give a starting time of 2017-01-01 00:00:00
.
val data = Seq(("2017-01-01 00:00:00",1.0),
("2017-01-01 00:15:00",2.0),
("2017-01-08 23:30:00",1.43))
val df = data.toDF("DateTime","value")
.withColumn("DateTime", to_timestamp($"DateTime", "yyyy-MM-dd HH:mm:ss"))
val df2 = df
.groupBy(window(col("DateTime"), "1 week", "1 week", "64 hours"))
.agg(sum("value") as "aggregate_sum")
.select("window.start", "window.end", "aggregate_sum")
将给出此结果数据框:
+-------------------+-------------------+-------------+
| start| end|aggregate_sum|
+-------------------+-------------------+-------------+
|2017-01-01 00:00:00|2017-01-08 00:00:00| 3.0|
|2017-01-08 00:00:00|2017-01-15 00:00:00| 1.43|
+-------------------+-------------------+-------------+
这篇关于在Spark中使用Windows函数进行每周汇总的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!