空值和countDistinct与spark数据框 [英] null value and countDistinct with spark dataframe

查看:200
本文介绍了空值和countDistinct与spark数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的数据框

I have a very simple dataframe

  df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])

  +----+---+---+
  |   a|  b|  c|
  +----+---+---+
  |null|  1|  3|
  |   2|  1|  3|
  |   2|  1|  3|
  +----+---+---+

当我在此数据帧上应用countDistinct时,我会根据方法发现不同的结果:

When I apply a countDistinct on this dataframe, I find different results depending on the method:

  df.distinct().count()

2

是我的结果,除了,最后两行相同,但第一行与其他两行不同(由于空值)

It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others

  import pyspark.sql.functions as F
  df.agg(F.countDistinct("a","b","c")).show()

1

对于我来说,F.countDistinct处理null值的方式似乎不直观.

It seems that the way F.countDistinct deals with the null value is not intuitive for me.

这对您来说是错误还是正常?如果是正常的话,我该如何写出与第一种方法完全相同但又能输出第一种方法的结果的东西.

Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.

推荐答案

countDistinct的工作方式与配置单元count(DISTINCT expr[, expr]) :

countDistinct works the same way as Hive count(DISTINCT expr[, expr]):

count(DISTINCT expr [,expr])-返回所提供的表达式唯一且非NULL的行数.

count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

不包括第一行.这对于SQL函数很常见.

The first row is not included. This is common for SQL functions.

这篇关于空值和countDistinct与spark数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆