空值和 countDistinct 与火花数据帧 [英] null value and countDistinct with spark dataframe

查看:27
本文介绍了空值和 countDistinct 与火花数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个非常简单的数据框

 df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])+----+---+---+|一个|乙|| |+----+---+---+|空|1|3||2|1|3||2|1|3|+----+---+---+

当我在这个数据帧上应用 countDistinct 时,我发现不同的结果取决于方法:

第一种方法

 df.distinct().count()

<块引用>

2

这是我的结果,最后两行是相同的,但第一行与其他两行不同(因为空值)

第二种方法

 import pyspark.sql.functions as Fdf.agg(F.countDistinct("a","b","c")).show()

<块引用>

1

似乎 F.countDistinct 处理 null 值的方式对我来说并不直观.

它对您来说是错误还是正常?如果这是正常的,我如何编写与第一种方法的结果完全相同但与第二种方法具有相同精神的东西.

解决方案

countDistinctHive count(DISTINCT expr[, expr]):

<块引用>

count(DISTINCT expr[, expr]) - 返回提供的表达式唯一且非 NULL 的行数.

不包括第一行.这对于 SQL 函数很常见.

I have a very simple dataframe

  df = spark.createDataFrame([(None,1,3),(2,1,3),(2,1,3)], ['a','b','c'])

  +----+---+---+
  |   a|  b|  c|
  +----+---+---+
  |null|  1|  3|
  |   2|  1|  3|
  |   2|  1|  3|
  +----+---+---+

When I apply a countDistinct on this dataframe, I find different results depending on the method:

First method

  df.distinct().count()

2

It's the result I except, the 2 last rows are identical but the first one is distinct (because of the null value) from the 2 others

Second Method

  import pyspark.sql.functions as F
  df.agg(F.countDistinct("a","b","c")).show()

1

It seems that the way F.countDistinct deals with the null value is not intuitive for me.

Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method.

解决方案

countDistinct works the same way as Hive count(DISTINCT expr[, expr]):

count(DISTINCT expr[, expr]) - Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

The first row is not included. This is common for SQL functions.

这篇关于空值和 countDistinct 与火花数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆