使用DataFrame按组分组的Python Spark累积总和 [英] Python Spark Cumulative Sum by Group Using DataFrame
本文介绍了使用DataFrame按组分组的Python Spark累积总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何使用DataFrame
abstraction
具体计算每组的累积总和;并在PySpark
中?
How do I compute the cumulative sum per group specifically using the DataFrame
abstraction
; and in PySpark
?
使用以下示例数据集:
df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")],
["time", "value", "class"] )
+----+-----+-----+
|time|value|class|
+----+-----+-----+
| 1| 2| a|
| 3| 2| a|
| 1| 3| b|
| 2| 2| a|
| 2| 3| b|
+----+-----+-----+
我想在(有序)time
变量的每个class
分组中添加value
的累加总和列.
I would like to add a cumulative sum column of value
for each class
grouping over the (ordered) time
variable.
推荐答案
这可以结合使用窗口函数和窗口范围内的Window.unboundedPreceding值来完成,如下所示:
This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowval = (Window.partitionBy('class').orderBy('time')
.rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()
+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
| 1| 3| b| 3|
| 2| 3| b| 6|
| 1| 2| a| 2|
| 2| 2| a| 4|
| 3| 2| a| 6|
+----+-----+-----+-------+
这篇关于使用DataFrame按组分组的Python Spark累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文