PySpark:withColumn()有两个条件和三个结果 [英] PySpark: withColumn() with two conditions and three outcomes
问题描述
我正在使用Spark和PySpark。我试图达到相当于下面伪代码的结果:
df = df.withColumn('new_column',
if fruit1 == fruit2 THEN 1,ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
我试图在PySpark中做到这一点,但我不确定语法。任何指针?我查看了 expr()
,但无法启用它。
请注意 df
是一个 pyspark.sql.dataframe.DataFrame
。
有几种有效的方法来实现这一点。让我们从所需的导入开始:
from pyspark.sql.functions import col,expr,when
$ b您可以在expr中使用Hive
IF
函数:new_column_1 = expr(
IF(fruit1 IS NULL OR fruit2 IS NULL,3,IF(fruit1 = fruit2,1,0) )
)
或
when $ c
$ bnew_column_2 = when($ b $)$
否则
: (col1(水果1))。col(fruit1)。isNull()| col(fruit2)。isNull(),3
).when(col(fruit1)== col(fruit2),1)。否则(0)
最后,您可以使用以下技巧:
from pyspark.sql.functions import coalesce,lit
new_column_3 = coalesce((col(fruit1)== col(fruit2 ))。cast(int),lit(3))
p>
df = sc.parallelize([
$ p $您可以如下使用它:
(orange,apple),(kiwi,None) ,(无,香蕉),
(芒果,芒果),(无,无)
])。toDF([fruit1,fruit2])
(df
.withColumn(new_column_1,new_column_1)
.withColumn(new_column_2,new_column_2)
.withColumn(new_column_3,new_column_3))
,结果是:
+ ------ + ------ + ------------ + ------------ + -------- ---- +
| fruit1 | fruit2 | new_column_1 | new_column_2 | new_column_3 |
+ ------ + ------ + ------------ + ------------ + ----- ------- +
| orange |苹果| 0 | 0 | 0 |
|猕猴桃|空| 3 | 3 | 3 |
|空|香蕉| 3 | 3 | 3 |
|芒果|芒果| 1 | 1 | 1 |
|空|空| 3 | 3 | 3 |
+ ------ + ------ + ------------ + ------------ + ----- ------- +
I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into
expr()
but couldn't get it to work.Note that
df
is apyspark.sql.dataframe.DataFrame
.解决方案There are a few efficient ways to implement this. Let's start with required imports:
from pyspark.sql.functions import col, expr, when
You can use Hive
IF
function inside expr:new_column_1 = expr( """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))""" )
or
when
+otherwise
:new_column_2 = when( col("fruit1").isNull() | col("fruit2").isNull(), 3 ).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally you could use following trick:
from pyspark.sql.functions import coalesce, lit new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With example data:
df = sc.parallelize([ ("orange", "apple"), ("kiwi", None), (None, "banana"), ("mango", "mango"), (None, None) ]).toDF(["fruit1", "fruit2"])
you can use this as follows:
(df .withColumn("new_column_1", new_column_1) .withColumn("new_column_2", new_column_2) .withColumn("new_column_3", new_column_3))
and the result is:
+------+------+------------+------------+------------+ |fruit1|fruit2|new_column_1|new_column_2|new_column_3| +------+------+------------+------------+------------+ |orange| apple| 0| 0| 0| | kiwi| null| 3| 3| 3| | null|banana| 3| 3| 3| | mango| mango| 1| 1| 1| | null| null| 3| 3| 3| +------+------+------------+------------+------------+
这篇关于PySpark:withColumn()有两个条件和三个结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!