空值和countDistinct与spark数据框 [英] null value and countDistinct with spark dataframe

查看：200 发布时间：2020/9/4 8:02:05 apache-spark pyspark pyspark-sql

本文介绍了空值和countDistinct与spark数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个非常简单的数据框

I have a very simple dataframe

  df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])

  +----+---+---+
  |   a|  b|  c|
  +----+---+---+
  |null|  1|  3|
  |   2|  1|  3|
  |   2|  1|  3|
  +----+---+---+

当我在此数据帧上应用countDistinct时，我会根据方法发现不同的结果:

When I apply a countDistinct on this dataframe, I find different results depending on the method:

  df.distinct().count()

2

是我的结果，除了，最后两行相同，但第一行与其他两行不同(由于空值)

It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others

  import pyspark.sql.functions as F
  df.agg(F.countDistinct("a","b","c")).show()

1

对于我来说，F.countDistinct处理null值的方式似乎不直观.

It seems that the way F.countDistinct deals with the null value is not intuitive for me.

这对您来说是错误还是正常?如果是正常的话，我该如何写出与第一种方法完全相同但又能输出第一种方法的结果的东西.

Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.

空值和countDistinct与spark数据框 [英] null value and countDistinct with spark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

空值和countDistinct与spark数据框 [英] null value and countDistinct with spark dataframe

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭