PySpark-基于条件的Fillna特定行 [英] PySpark - Fillna specific rows based on condition
问题描述
我想替换数据框中的空值,但只替换符合特定条件的行.
I want to replace null values in a dataframe, but only on rows that match an specific criteria.
我有这个数据框:
A|B |C |D |
1|null|null|null|
2|null|null|null|
2|null|null|null|
2|null|null|null|
5|null|null|null|
我要这样做:
A|B |C |D |
1|null|null|null|
2|x |x |x |
2|x |x |x |
2|x |x |x |
5|null|null|null|
我的案子
因此,列A中数字为2的所有行都应替换.
So all the rows that have the number 2 in the column A should get replaced.
A,B,C,D列是动态的,它们的数字和名称将更改.
The columns A, B, C, D are dynamic, they will change in numbers and names.
我还希望能够选择所有行,而不仅仅是被替换的行.
I also want to be able to select all the rows, not only the replaced ones.
我尝试过的事情
我尝试使用df.where和fillna,但是它不能保留所有行.
I tried with df.where and fillna, but it does not keep all the rows.
虽然我也要处理withColumn,但我只知道A列,其他所有列在每次执行时都会更改.
I also though about doing with withColumn, but I only know the column A, all the others will change on each execution.
适应的解决方案:
df.select("A",
*[
when(col("A") == '2',
coalesce(col(c),
lit('0').cast(df.schema[c].dataType))
).otherwise(col(c)).alias(c)
for c in cols_to_replace
])
推荐答案
使用 pyspark.sql.functions.coalesce
:
from pyspark.sql.functions import coalesce, col, lit, when
cols_to_replace = df.columns[1:]
df.select(
"A",
*[
when(col("A")==2, coalesce(col(c), lit("x"))).otherwise(col(c)).alias(c)
for c in cols_to_replace
]
).show()
#+---+----+----+----+
#| A| B| C| D|
#+---+----+----+----+
#| 1|null|null|null|
#| 2| x| x| x|
#| 2| x| x| x|
#| 2| x| x| x|
#| 5|null|null|null|
#+---+----+----+----+
在列表理解中,您检查 A
的值是否为 2
.如果是,则合并该列的值和文字 x
.这会将 null
替换为 x
.否则,请保持相同的列值.
Inside the list comprehension, you check to see if the value of A
is 2
. If yes, then you coalesce the value of the column and the literal x
. This will replace null
s with x
. Otherwise, keep the same column value.
这篇关于PySpark-基于条件的Fillna特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!