数组元素的总和取决于值条件pyspark [英] Sum of array elements depending on value condition pyspark
本文介绍了数组元素的总和取决于值条件pyspark的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个pyspark数据框:
I have a pyspark dataframe:
id | column
------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]
------------------------------
2 | [7, 0.3, 0.3, 8, 2,]
------------------------------
我想创建3列:
-
Column 1
:包含元素<的总和. 2 -
Column 2
:包含> 2的元素之和 -
Column 3
:包含元素的总和= 2(有时我有重复的值,所以我将其总和)如果没有 值设为空.
Column 1
: contain the sum of the elements < 2Column 2
: contain the sum of the elements > 2Column 3
: contain the sum of the elements = 2 (some times I have duplicate values so I do their sum) In case if I don't have a values I put null.
预期结果:
id | column | column<2 | column>2 | column=2
------------------------------|--------------------------------------------
1 | [0.2, 2, 3, 4, 3, 0.5]| [0.7] | [12] | null
---------------------------------------------------------------------------
2 | [7, 0.3, 0.3, 8, 2,] | [0.6] | [15] | [2]
---------------------------------------------------------------------------
你能帮我吗? 谢谢
推荐答案
对于Spark 2.4+,您可以使用 filter
像这样的高阶函数:
For Spark 2.4+, you can use aggregate
and filter
higher-order functions like this:
df.withColumn("column<2", expr("aggregate(filter(column, x -> x < 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column>2", expr("aggregate(filter(column, x -> x > 2), 0D, (x, acc) -> acc + x)")) \
.withColumn("column=2", expr("aggregate(filter(column, x -> x == 2), 0D, (x, acc) -> acc + x)")) \
.show(truncate=False)
赠予:
+---+------------------------------+--------+--------+--------+
|id |column |column<2|column>2|column=2|
+---+------------------------------+--------+--------+--------+
|1 |[0.2, 2.0, 3.0, 4.0, 3.0, 0.5]|0.7 |10.0 |2.0 |
|2 |[7.0, 0.3, 0.3, 8.0, 2.0] |0.6 |15.0 |2.0 |
+---+------------------------------+--------+--------+--------+
这篇关于数组元素的总和取决于值条件pyspark的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文