空值和 countDistinct 与火花数据帧 [英] null value and countDistinct with spark dataframe
问题描述
我有一个非常简单的数据框
df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])+----+---+---+|一个|乙|| |+----+---+---+|空|1|3||2|1|3||2|1|3|+----+---+---+
当我在这个数据帧上应用 countDistinct
时,我发现不同的结果取决于方法:
第一种方法
df.distinct().count()
<块引用>
2
这是我的结果,最后两行是相同的,但第一行与其他两行不同(因为空值)
第二种方法
import pyspark.sql.functions as Fdf.agg(F.countDistinct("a","b","c")).show()
<块引用>
1
似乎 F.countDistinct
处理 null
值的方式对我来说并不直观.
它对您来说是错误还是正常?如果这是正常的,我如何编写与第一种方法的结果完全相同但与第二种方法具有相同精神的东西.
countDistinct
与 Hive count(DISTINCT expr[, expr])
:
count(DISTINCT expr[, expr]) - 返回提供的表达式唯一且非 NULL 的行数.
不包括第一行.这对于 SQL 函数很常见.
I have a very simple dataframe
df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])
+----+---+---+
| a| b| c|
+----+---+---+
|null| 1| 3|
| 2| 1| 3|
| 2| 1| 3|
+----+---+---+
When I apply a countDistinct
on this dataframe, I find different results depending on the method:
First method
df.distinct().count()
2
It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others
Second Method
import pyspark.sql.functions as F
df.agg(F.countDistinct("a","b","c")).show()
1
It seems that the way F.countDistinct
deals with the null
value is not intuitive for me.
Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.
countDistinct
works the same way as Hive count(DISTINCT expr[, expr])
:
count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.
The first row is not included. This is common for SQL functions.
这篇关于空值和 countDistinct 与火花数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!