countDistinct 和 distinct.count 的区别 [英] The difference between countDistinct and distinct.count
问题描述
为什么 ..agg(countDistinct("member_id") as "count")
和 ..distinct.count
的输出不同?select count(distinct member_id)
和select distinct count(member_id)
的区别是一样的吗?
Why do I get different outputs for ..agg(countDistinct("member_id") as "count")
and ..distinct.count
?
Is the difference the same as between select count(distinct member_id)
and select distinct count(member_id)
?
推荐答案
df.agg(countDistinct("member_id") as "count")
返回 member_id
列的不同值的数量,忽略所有其他列,而
returns the number of distinct values of the member_id
column, ignoring all other columns, while
df.distinct.count
将计算 DataFrame 中不同记录的数量 - 其中distinct"表示所有列的值相同.
will count the number of distinct records in the DataFrame - where "distinct" means identical in values of all columns.
例如,DataFrame:
So, for example, the DataFrame:
+-----------+---------+
|member_name|member_id|
+-----------+---------+
| a| 1|
| b| 1|
| b| 1|
+-----------+---------+
只有一个不同的 member_id
值但有两个不同的记录,因此 agg
选项将返回 1,而后者将返回 2.
has only one distinct member_id
value but two distinct records, so the agg
option would return 1 while the latter would return 2.
这篇关于countDistinct 和 distinct.count 的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!