PySpark:withColumn()有两个条件和三个结果 [英] PySpark: withColumn() with two conditions and three outcomes

查看:8027
本文介绍了PySpark:withColumn()有两个条件和三个结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Spark和PySpark。我试图达到相当于下面伪代码的结果:

  df = df.withColumn('new_column',
if fruit1 == fruit2 THEN 1,ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)

我试图在PySpark中做到这一点,但我不确定语法。任何指针?我查看了 expr(),但无法启用它。



请注意 df 是一个 pyspark.sql.dataframe.DataFrame

解决方案

有几种有效的方法来实现这一点。让我们从所需的导入开始:

  from pyspark.sql.functions import col,expr,when 

$ b

您可以在expr中使用Hive IF 函数:

  new_column_1 = expr(
IF(fruit1 IS NULL OR fruit2 IS NULL,3,IF(fruit1 = fruit2,1,0) )

when $ b

  new_column_2 = when($ b $)$ 否则

(col1(水果1))。col(fruit1)。isNull()| col(fruit2)。isNull(),3
).when(col(fruit1)== col(fruit2),1)。否则(0)

最后,您可以使用以下技巧:

  from pyspark.sql.functions import coalesce,lit 

new_column_3 = coalesce((col(fruit1)== col(fruit2 ))。cast(int),lit(3))

p>

  df = sc.parallelize([
(orange,apple),(kiwi,None) ,(无,香蕉),
(芒果,芒果),(无,无)
])。toDF([fruit1,fruit2])


 (df 
.withColumn(new_column_1,new_column_1)
.withColumn(new_column_2,new_column_2)
.withColumn(new_column_3,new_column_3))

,结果是:

  + ------ + ------ + ------------ + ------------ + -------- ---- + 
| fruit1 | fruit2 | new_column_1 | new_column_2 | new_column_3 |
+ ------ + ------ + ------------ + ------------ + ----- ------- +
| orange |苹果| 0 | 0 | 0 |
|猕猴桃|空| 3 | 3 | 3 |
|空|香蕉| 3 | 3 | 3 |
|芒果|芒果| 1 | 1 | 1 |
|空|空| 3 | 3 | 3 |
+ ------ + ------ + ------------ + ------------ + ----- ------- +


I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:

df = df.withColumn('new_column', 
    IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)

I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.

Note that df is a pyspark.sql.dataframe.DataFrame.

解决方案

There are a few efficient ways to implement this. Let's start with required imports:

from pyspark.sql.functions import col, expr, when

You can use Hive IF function inside expr:

new_column_1 = expr(
    """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)

or when + otherwise:

new_column_2 = when(
    col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)

Finally you could use following trick:

from pyspark.sql.functions import coalesce, lit

new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))

With example data:

df = sc.parallelize([
    ("orange", "apple"), ("kiwi", None), (None, "banana"), 
    ("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])

you can use this as follows:

(df
    .withColumn("new_column_1", new_column_1)
    .withColumn("new_column_2", new_column_2)
    .withColumn("new_column_3", new_column_3))

and the result is:

+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple|           0|           0|           0|
|  kiwi|  null|           3|           3|           3|
|  null|banana|           3|           3|           3|
| mango| mango|           1|           1|           1|
|  null|  null|           3|           3|           3|
+------+------+------------+------------+------------+

这篇关于PySpark:withColumn()有两个条件和三个结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆