Pyspark-重置条件下的累积总和 [英] Pyspark - Cumulative sum with reset condition
问题描述
我有这个数据框
+---+----+---+
| A| B| C|
+---+----+---+
| 0|null| 1|
| 1| 3.0| 0|
| 2| 7.0| 0|
| 3|null| 1|
| 4| 4.0| 0|
| 5| 3.0| 0|
| 6|null| 1|
| 7|null| 1|
| 8|null| 1|
| 9| 5.0| 0|
| 10| 2.0| 0|
| 11|null| 1|
+---+----+---+
我需要什么do是从C列开始的值的累积总和,直到下一个值为零,然后重置该累积总和,直到完成所有行。
What I need do is a cumulative sum of values from column C until the next value is zero, then reset the cumulative sum, doing this until finish all rows.
预期的输出:
+---+----+---+----+
| A| B| C| D|
+---+----+---+----+
| 0|null| 1| 1|
| 1| 3.0| 0| 0|
| 2| 7.0| 0| 0|
| 3|null| 1| 1|
| 4| 4.0| 0| 0|
| 5| 3.0| 0| 0|
| 6|null| 1| 1|
| 7|null| 1| 2|
| 8|null| 1| 3|
| 9| 5.0| 0| 0|
| 10| 2.0| 0| 0|
| 11|null| 1| 1|
+---+----+---+----+
我已经使用过 Window()。rangeBetween
函数并达到所需的输出,但是问题是您无法定义固定的窗口范围,因为DataFrame可以连续有五次数字 1
,有时只能是两次,等等。
I have already used the Window().rangeBetween
function and reached the desired output, but the problem is that you can not define a fixed window range, because the DataFrame can have five times in a row the number 1
, sometimes can be only two, etc.
我的问题是 Pyspark:具有重置条件的累积总和,但没有人回答。
My question is very similar of this Pyspark : Cumulative Sum with reset condition, but nobody has answered.
要重现数据帧:
from pyspark.shell import sc
from pyspark.sql import Window
from pyspark.sql.functions import lag, when, sum
x = sc.parallelize([
[0, None], [1, 3.], [2, 7.], [3, None], [4, 4.],
[5, 3.], [6, None], [7, None], [8, None], [9, 5.], [10, 2.], [11, None]])
x = x.toDF(['A', 'B'])
# Transform null values into "1"
x = x.withColumn('C', when(x.B.isNull(), 1).otherwise(0))
推荐答案
创建一个临时列( grp
),每次都会增加一个计数器 C
列等于 0
(重置条件),并将其用作您的累加总和的分区列。 / p>
Create a temporary column (grp
) that increments a counter each time column C
is equal to 0
(the reset condition) and use this as a partitioning column for your cumulative sum.
import pyspark.sql.functions as f
from pyspark.sql import Window
x.withColumn(
"grp",
f.sum((f.col("C") == 0).cast("int")).over(Window.orderBy("A"))
).withColumn(
"D",
f.sum(f.col("C")).over(Window.partitionBy("grp").orderBy("A"))
).drop("grp").show()
#+---+----+---+---+
#| A| B| C| D|
#+---+----+---+---+
#| 0|null| 1| 1|
#| 1| 3.0| 0| 0|
#| 2| 7.0| 0| 0|
#| 3|null| 1| 1|
#| 4| 4.0| 0| 0|
#| 5| 3.0| 0| 0|
#| 6|null| 1| 1|
#| 7|null| 1| 2|
#| 8|null| 1| 3|
#| 9| 5.0| 0| 0|
#| 10| 2.0| 0| 0|
#| 11|null| 1| 1|
#+---+----+---+---+
这篇关于Pyspark-重置条件下的累积总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!