使用 None 值过滤 Pyspark 数据框列 [英] Filter Pyspark dataframe column with None value
问题描述
我正在尝试过滤具有 None
作为行值的 PySpark 数据帧:
I'm trying to filter a PySpark dataframe that has None
as a row value:
df.select('dt_mvmt').distinct().collect()
[Row(dt_mvmt=u'2016-03-27'),
Row(dt_mvmt=u'2016-03-28'),
Row(dt_mvmt=u'2016-03-29'),
Row(dt_mvmt=None),
Row(dt_mvmt=u'2016-03-30'),
Row(dt_mvmt=u'2016-03-31')]
并且我可以使用字符串值正确过滤:
and I can filter correctly with an string value:
df[df.dt_mvmt == '2016-03-31']
# some results here
但这失败了:
df[df.dt_mvmt == None].count()
0
df[df.dt_mvmt != None].count()
0
但每个类别肯定都有价值.怎么回事?
But there are definitely values on each category. What's going on?
推荐答案
您可以使用 Column.isNull
/Column.isNotNull
:
df.where(col("dt_mvmt").isNull())
df.where(col("dt_mvmt").isNotNull())
如果你想简单地删除 NULL
值,你可以使用 na.drop
和 subset
参数:
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
df.na.drop(subset=["dt_mvmt"])
与 NULL
的基于相等的比较将不起作用,因为在 SQL 中 NULL
是未定义的,因此任何将其与另一个值进行比较的尝试都会返回 NULL
:
Equality based comparisons with NULL
won't work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
sqlContext.sql("SELECT NULL = NULL").show()
## +-------------+
## |(NULL = NULL)|
## +-------------+
## | null|
## +-------------+
sqlContext.sql("SELECT NULL != NULL").show()
## +-------------------+
## |(NOT (NULL = NULL))|
## +-------------------+
## | null|
## +-------------------+
将值与 NULL
进行比较的唯一有效方法是 IS
/IS NOT
,它们等价于 isNull
>/isNotNull
方法调用.
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.
这篇关于使用 None 值过滤 Pyspark 数据框列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!