使用DataFrame按组分组的Python Spark累积总和 [英] Python Spark Cumulative Sum by Group Using DataFrame

查看：464 发布时间：2020/9/4 0:10:37 apache-spark pyspark spark-dataframe

本文介绍了使用DataFrame按组分组的Python Spark累积总和的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用DataFrame abstraction具体计算每组的累积总和；并在PySpark中?

How do I compute the cumulative sum per group specifically using the DataFrame abstraction; and in PySpark?

使用以下示例数据集:

df = sqlContext.createDataFrame( [(1,2,"a"),(3,2,"a"),(1,3,"b"),(2,2,"a"),(2,3,"b")], 
                                 ["time", "value", "class"] )

+----+-----+-----+
|time|value|class|
+----+-----+-----+
|   1|    2|    a|
|   3|    2|    a|
|   1|    3|    b|
|   2|    2|    a|
|   2|    3|    b|
+----+-----+-----+

我想在(有序)time变量的每个class分组中添加value的累加总和列.

I would like to add a cumulative sum column of value for each class grouping over the (ordered) time variable.

推荐答案

这可以结合使用窗口函数和窗口范围内的Window.unboundedPreceding值来完成，如下所示:

This can be done using a combination of a window function and the Window.unboundedPreceding value in the window's range as follows:

from pyspark.sql import Window
from pyspark.sql import functions as F

windowval = (Window.partitionBy('class').orderBy('time')
             .rangeBetween(Window.unboundedPreceding, 0))
df_w_cumsum = df.withColumn('cum_sum', F.sum('value').over(windowval))
df_w_cumsum.show()

+----+-----+-----+-------+
|time|value|class|cum_sum|
+----+-----+-----+-------+
|   1|    3|    b|      3|
|   2|    3|    b|      6|
|   1|    2|    a|      2|
|   2|    2|    a|      4|
|   3|    2|    a|      6|
+----+-----+-----+-------+

这篇关于使用DataFrame按组分组的Python Spark累积总和的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用DataFrame按组分组的Python Spark累积总和 [英] Python Spark Cumulative Sum by Group Using DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用DataFrame按组分组的Python Spark累积总和 [英] Python Spark Cumulative Sum by Group Using DataFrame

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭