countDistinct 和 distinct.count 的区别 [英] The difference between countDistinct and distinct.count

查看:69
本文介绍了countDistinct 和 distinct.count 的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么 ..agg(countDistinct("member_id") as "count")..distinct.count 的输出不同?select count(distinct member_id)select distinct count(member_id)的区别是一样的吗?

Why do I get different outputs for ..agg(countDistinct("member_id") as "count") and ..distinct.count? Is the difference the same as between select count(distinct member_id) and select distinct count(member_id)?

推荐答案

df.agg(countDistinct("member_id") as "count")

返回 member_id 列的不同值的数量,忽略所有其他列,而

returns the number of distinct values of the member_id column, ignoring all other columns, while

df.distinct.count

将计算 DataFrame 中不同记录的数量 - 其中distinct"表示所有列的值相同.

will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.

例如,DataFrame:

So, for example, the DataFrame:

+-----------+---------+
|member_name|member_id|
+-----------+---------+
|          a|        1|
|          b|        1|
|          b|        1|
+-----------+---------+

只有一个不同的 member_id 值但有两个不同的记录,因此 agg 选项将返回 1,而后者将返回 2.

has only one distinct member_id value but two distinct records, so the agg option would return 1 while the latter would return 2.

这篇关于countDistinct 和 distinct.count 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆