PySpark：withColumn（）有两个条件和三个结果 [英] PySpark: withColumn() with two conditions and three outcomes

查看：8027 发布时间：2018/6/12 13:55:06 python apache-spark hive pyspark hiveql

本文介绍了PySpark：withColumn（）有两个条件和三个结果的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Spark和PySpark。我试图达到相当于下面伪代码的结果：

  df = df.withColumn（'new_column'，
 if fruit1 == fruit2 THEN 1，ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.）

我试图在PySpark中做到这一点，但我不确定语法。任何指针？我查看了 expr（），但无法启用它。

请注意 df 是一个 pyspark.sql.dataframe.DataFrame 。

解决方案

有几种有效的方法来实现这一点。让我们从所需的导入开始：

  from pyspark.sql.functions import col，expr，when 
  
 $ b 您可以在expr中使用Hive  IF 函数：
  new_column_1 = expr（
IF（fruit1 IS NULL OR fruit2 IS NULL，3，IF（fruit1 = fruit2,1,0） ）
）
  
或 when  $ b 
  new_column_2 = when（$ b $）$ 否则：
 （col1（水果1））。col（fruit1）。isNull（）| col（fruit2）。isNull（），3 
）.when（col（fruit1）== col（fruit2），1）。否则（0）
  
最后，您可以使用以下技巧：
  from pyspark.sql.functions import coalesce，lit 
 
 new_column_3 = coalesce（（col（fruit1）== col（fruit2 ））。cast（int），lit（3））
  
 p> 
 
 
  df = sc.parallelize（[
（orange，apple），（kiwi，None） ，（无，香蕉），
 （芒果，芒果），（无，无）
]）。toDF（[fruit1，fruit2]）
   
 
 
 （df 
 .withColumn（new_column_1，new_column_1）
 .withColumn（new_column_2，new_column_2）
 .withColumn（new_column_3，new_column_3））
  
，结果是： 
 
 
  + ------ + ------ + ------------ + ------------ + -------- ---- + 
 | fruit1 | fruit2 | new_column_1 | new_column_2 | new_column_3 | 
 + ------ + ------ + ------------ + ------------ + ----- ------- + 
 | orange |苹果| 0 | 0 | 0 | 
 |猕猴桃|空| 3 | 3 | 3 | 
 |空|香蕉| 3 | 3 | 3 | 
 |芒果|芒果| 1 | 1 | 1 | 
 |空|空| 3 | 3 | 3 | 
 + ------ + ------ + ------------ + ------------ + ----- ------- + 
  
 
I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode:
df = df.withColumn('new_column', 
    IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.)
I am trying to do this in PySpark but I'm not sure about the syntax. Any pointers? I looked into expr() but couldn't get it to work.


Note that df is a pyspark.sql.dataframe.DataFrame.
 解决方案 
There are a few efficient ways to implement this. Let's start with required imports:
from pyspark.sql.functions import col, expr, when
You can use Hive IF function inside expr:
new_column_1 = expr(
    """IF(fruit1 IS NULL OR fruit2 IS NULL, 3, IF(fruit1 = fruit2, 1, 0))"""
)
or when + otherwise:
new_column_2 = when(
    col("fruit1").isNull() | col("fruit2").isNull(), 3
).when(col("fruit1") == col("fruit2"), 1).otherwise(0)
Finally you could use following trick:
from pyspark.sql.functions import coalesce, lit

new_column_3 = coalesce((col("fruit1") == col("fruit2")).cast("int"), lit(3))
With example data:
df = sc.parallelize([
    ("orange", "apple"), ("kiwi", None), (None, "banana"), 
    ("mango", "mango"), (None, None)
]).toDF(["fruit1", "fruit2"])
you can use this as follows:
(df
    .withColumn("new_column_1", new_column_1)
    .withColumn("new_column_2", new_column_2)
    .withColumn("new_column_3", new_column_3))
and the result is:
+------+------+------------+------------+------------+
|fruit1|fruit2|new_column_1|new_column_2|new_column_3|
+------+------+------------+------------+------------+
|orange| apple|           0|           0|           0|
|  kiwi|  null|           3|           3|           3|
|  null|banana|           3|           3|           3|
| mango| mango|           1|           1|           1|
|  null|  null|           3|           3|           3|
+------+------+------------+------------+------------+


                        
这篇关于PySpark：withColumn（）有两个条件和三个结果的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

PySpark：withColumn（）有两个条件和三个结果 [英] PySpark: withColumn() with two conditions and three outcomes

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

PySpark：withColumn（）有两个条件和三个结果 [英] PySpark: withColumn() with two conditions and three outcomes

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭