Spark DataFrame:计算每列的不同值 [英] Spark DataFrame: count distinct values of every column
问题描述
标题中的问题很多:是否有一种有效的方法来计算DataFrame中每一列中的不同值?
The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?
描述方法仅提供计数,但不提供非重复计数,我想知道是否存在一种方法来获取所有(或某些选定)列的非重复计数.
The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.
推荐答案
多个聚合计算起来非常昂贵.我建议您改用近似方法.在这种情况下,近似的非重复计数:
Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:
val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")
val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// | 2| 2| 3|
// +---------------------------+---------------------------+---------------------------+
approx_count_distinct
方法依赖于引擎盖下的 HyperLogLog .
The approx_count_distinct
method relies on HyperLogLog under the hood.
HyperLogLog 算法及其变体HyperLogLog ++(在Spark中实现)依赖于以下 clever 观察.
The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.
如果数字均匀地分布在一个范围内,则可以从数字的二进制表示形式中的前导零的最大数量中近似得出不同元素的数量.
If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.
例如,如果我们观察到一个数字,其二进制形式的数字为0…(k times)…01…1
形式,那么我们可以估计集合中存在2 ^ k个元素.这是一个非常粗略的估计,但是可以使用草绘算法将其精炼到很高的精度.
For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1
, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.
A thorough explanation of the mechanics behind this algorithm can be found in the original paper.
注意:从 Spark 1.6 开始,当Spark调用SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df
时,每个子句应分别触发每个子句的聚合.而这与我们一次汇总的SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df
不同.因此,当使用count(distinct(_))
和approxCountDistinct
(或approx_count_distinct
)时,性能将不具有可比性.
Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df
each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df
where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_))
and approxCountDistinct
(or approx_count_distinct
).
这是自Spark 1.6起的行为变化之一:
It's one of the changes of behavior since Spark 1.6 :
使用针对具有不同聚合的查询的改进查询计划程序(SPARK-9241),具有单个不同聚合的查询的计划已更改为更可靠的版本.要切换回由Spark 1.5的计划者生成的计划,请将spark.sql.specializeSingleDistinctAggPlanning设置为true. (SPARK-12077)
参考文献: Apache Spark中的近似算法:HyperLogLog和分位数.
这篇关于Spark DataFrame:计算每列的不同值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!