Spark DataDrame中=== null和isNull之间的差异 [英] Difference between === null and isNull in Spark DataDrame

查看:121
本文介绍了Spark DataDrame中=== null和isNull之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我们使用

 df.filter(col("c1") === null) and df.filter(col("c1").isNull) 

我正在计数的相同数据框 === null,但isNull中的计数为零.请帮助我了解区别.谢谢

Same dataframe I am getting counts in === null but zero counts in isNull. Please help me to understand the difference. Thanks

推荐答案

首先,除非出于兼容性原因,否则不要在Scala代码中使用null.

First and foremost don't use null in your Scala code unless you really have to for compatibility reasons.

关于您的问题,它是纯SQL. col("c1") === null被解释为c1 = NULL,并且由于NULL标记了未定义的值,因此对于包括NULL本身的任何值,结果都是未定义的.

Regarding your question it is plain SQL. col("c1") === null is interpreted as c1 = NULL and, because NULL marks undefined values, result is undefined for any value including NULL itself.

spark.sql("SELECT NULL = NULL").show

+-------------+
|(NULL = NULL)|
+-------------+
|         null|
+-------------+

spark.sql("SELECT NULL != NULL").show

+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
|               null|
+-------------------+

spark.sql("SELECT TRUE != NULL").show

+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
|                                null|
+------------------------------------+

spark.sql("SELECT TRUE = NULL").show

+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
|                          null|
+------------------------------+

检查NULL的唯一有效方法是:

The only valid methods to check for NULL are:

  • IS NULL:

spark.sql("SELECT NULL IS NULL").show

+--------------+
|(NULL IS NULL)|
+--------------+
|          true|
+--------------+

spark.sql("SELECT TRUE IS NULL").show

+--------------+
|(true IS NULL)|
+--------------+
|         false|
+--------------+

  • IS NOT NULL:

    spark.sql("SELECT NULL IS NOT NULL").show
    

    +------------------+
    |(NULL IS NOT NULL)|
    +------------------+
    |             false|
    +------------------+
    

    spark.sql("SELECT TRUE IS NOT NULL").show
    

    +------------------+
    |(true IS NOT NULL)|
    +------------------+
    |              true|
    +------------------+
    

  • DataFrame DSL中分别实现为Column.isNullColumn.isNotNull.

    implemented in DataFrame DSL as Column.isNull and Column.isNotNull respectively.

    注意:

    对于NULL安全比较,请使用IS DISTINCT/IS NOT DISTINCT:

    For NULL-safe comparisons use IS DISTINCT / IS NOT DISTINCT:

    spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
    

    +---------------+
    |(NULL <=> NULL)|
    +---------------+
    |           true|
    +---------------+
    

    spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
    

    +--------------------------------+
    |(CAST(NULL AS BOOLEAN) <=> true)|
    +--------------------------------+
    |                           false|
    +--------------------------------+
    

    not(_ <=> _)/<=>

    spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
    

    +---------------+
    |(col1 <=> col2)|
    +---------------+
    |           true|
    +---------------+
    

    spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
    

    +---------------+
    |(col1 <=> col2)|
    +---------------+
    |          false|
    +---------------+
    

    分别在SQL和DataFrame DSL中.

    in SQL and DataFrame DSL respectively.

    相关:

    在Apache Spark Join中包含空值

    这篇关于Spark DataDrame中=== null和isNull之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆