如何根据条件在 pyspark 中跨连续行保留值 [英] How to preserve in pyspark a value across consecutive rows based on condition
问题描述
我使用的是 Spark 3.0.1,这是我的 pyspark DataFrame 的一个示例:
I'm on Spark 3.0.1 and this is an exemplification of my pyspark DataFrame:
| label| amount| bool |
-----------------------------
| a | 10 | false |
| a | 2 | false |
| b | 20 | true |
| c | 3 | true |
| d | 2 | false |
| f | 5 | false |
| w | 50 | true |
...
...
这是我用来生成上述示例的代码:
This is the code I used to produce aforementioned exemplification:
df = spark.createDataFrame(pd.DataFrame({
'label': ["a", "a", "b", "c", "d", "f", "w"],
'amount': [10, 2, 20, 3, 2, 5, 50],
'bool': [False, False, True, True, False, False, True]
}))
我想执行一项在我看来非常简单但我无法完成的任务.
I'd like to perform a task which seems to me really simple but that I'm not managing to achieve.
我特别想:
- 按
label
排序(示例中已经假设) - 分配给新的
true_label
列,值如下:label
中的值,如果bool
为false
- 最新的(按照
label
顺序)label
已经在bool
中遇到了false
- order by
label
(already assumpted in the example) - assign to a new
true_label
column, values as such:- the value in
label
, ifbool
isfalse
- the latest (following the
label
ordering)label
already met with afalse
inbool
- the value in
对上一个示例的更新应该有助于更好地理解预期结果:
An update of the previous example should help for a better understanding of the expected result:
| label| amount| bool | real_label |
-----------------------------------
| a | 10 | false | a | <- because `bool` is false, `real_label` = `label`
| a | 2 | false | a | <- because `bool` is false, `real_label` = `label`
| b | 20 | true | a | <- because `a` the latest `label` with a `false` in `bool`
| c | 3 | true | a | <- because `a` the latest `label` with a `false` in `bool`
| d | 2 | false | d | <- because `bool` is false, `real_label` = `label`
| f | 5 | false | f | <- because `bool` is false, `real_label` = `label`
| w | 50 | true | f | <- because `f` the latest `label` with a `false` in `bool`
...
...
是否有可能在不知道我能遇到的连续 false
的数量的情况下实现我想要的,并且还考虑到真实的数据框非常大并且性能很重要(因此,不幸的是,基于 toPandas 的答案是proibitive并且最好避免使用 udf
函数)?
Is It possible to achieve what I want without knowing the number of consecutive false
I can meet and also given that the real dataframe is very big and performance matter (so toPandas-based answers are unfortunately proibitive and would also be better to avoid udf
functions)?
推荐答案
使用窗口函数last
获取上一个假标签"如果为真,否则保留标签.
Use window function last
to get the previous "false label" if true, otherwise keep the label.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'real_label',
F.when(
F.col('bool'), # get previous false label if true
F.last(
F.when(~F.col('bool'), F.col('label')), # keep false labels and mask true labels with null
ignorenulls=True
).over(Window.orderBy('label'))
).otherwise(F.col('label')) # otherwise keep label if false
)
df2.show()
+-----+------+-----+----------+
|label|amount| bool|real_label|
+-----+------+-----+----------+
| a| 10|false| a|
| a| 2|false| a|
| b| 20| true| a|
| c| 3| true| a|
| d| 2|false| d|
| f| 5|false| f|
| w| 50| true| f|
+-----+------+-----+----------+
这篇关于如何根据条件在 pyspark 中跨连续行保留值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!