如何根据条件在 pyspark 中跨连续行保留值 [英] How to preserve in pyspark a value across consecutive rows based on condition

查看:16
本文介绍了如何根据条件在 pyspark 中跨连续行保留值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是 Spark 3.0.1,这是我的 pyspark DataFrame 的一个示例:

I'm on Spark 3.0.1 and this is an exemplification of my pyspark DataFrame:

| label| amount| bool   |
 ----------------------------- 
| a    | 10    | false  |
| a    | 2     | false  |
| b    | 20    | true   |
| c    | 3     | true   |
| d    | 2     | false  |
| f    | 5     | false  |
| w    | 50    | true   |
...
...

这是我用来生成上述示例的代码:

This is the code I used to produce aforementioned exemplification:

df = spark.createDataFrame(pd.DataFrame({
     'label': ["a", "a", "b", "c", "d", "f", "w"],
     'amount': [10, 2, 20, 3, 2, 5, 50],
     'bool': [False, False, True, True, False, False, True]
     }))

我想执行一项在我看来非常简单但我无法完成的任务.

I'd like to perform a task which seems to me really simple but that I'm not managing to achieve.

我特别想:

  1. label 排序(示例中已经假设)
  2. 分配给新的 true_label 列,值如下:
    • label 中的值,如果 boolfalse
    • 最新的(按照label顺序)label已经在bool中遇到了false
  1. order by label (already assumpted in the example)
  2. assign to a new true_label column, values as such:
    • the value in label, if bool is false
    • the latest (following the label ordering) label already met with a false in bool

对上一个示例的更新应该有助于更好地理解预期结果:

An update of the previous example should help for a better understanding of the expected result:

| label| amount| bool   | real_label |
 ----------------------------------- 
| a    | 10    | false  | a          |  <- because `bool` is false, `real_label` = `label`
| a    | 2     | false  | a          |  <- because `bool` is false, `real_label` = `label`
| b    | 20    | true   | a          |  <- because `a` the latest `label` with a `false` in `bool` 
| c    | 3     | true   | a          |  <- because `a` the latest `label` with a `false` in `bool` 
| d    | 2     | false  | d          |  <- because `bool` is false, `real_label` = `label`
| f    | 5     | false  | f          |  <- because `bool` is false, `real_label` = `label`
| w    | 50    | true   | f          |  <- because `f` the latest `label` with a `false` in `bool` 
...
...

是否有可能在不知道我能遇到的连续 false 的数量的情况下实现我想要的,并且还考虑到真实的数据框非常大并且性能很重要(因此,不幸的是,基于 toPandas 的答案是proibitive并且最好避免使用 udf 函数)?

Is It possible to achieve what I want without knowing the number of consecutive false I can meet and also given that the real dataframe is very big and performance matter (so toPandas-based answers are unfortunately proibitive and would also be better to avoid udf functions)?

推荐答案

使用窗口函数last获取上一个假标签"如果为真,否则保留标签.

Use window function last to get the previous "false label" if true, otherwise keep the label.

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'real_label',
    F.when(
        F.col('bool'),    # get previous false label if true
        F.last(
            F.when(~F.col('bool'), F.col('label')),    # keep false labels and mask true labels with null
            ignorenulls=True
        ).over(Window.orderBy('label'))
    ).otherwise(F.col('label'))    # otherwise keep label if false
)

df2.show()
+-----+------+-----+----------+
|label|amount| bool|real_label|
+-----+------+-----+----------+
|    a|    10|false|         a|
|    a|     2|false|         a|
|    b|    20| true|         a|
|    c|     3| true|         a|
|    d|     2|false|         d|
|    f|     5|false|         f|
|    w|    50| true|         f|
+-----+------+-----+----------+

这篇关于如何根据条件在 pyspark 中跨连续行保留值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆