Spark DataFrame:计算每列的不同值 [英] Spark DataFrame: count distinct values of every column

查看:29
本文介绍了Spark DataFrame:计算每列的不同值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题几乎在标题中:是否有一种有效的方法来计算 DataFrame 中每一列中的不同值?

The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame?

describe 方法只提供计数而不提供非重复计数,我想知道是否有一种方法可以获得所有(或某些选定的)列的非重复计数.

The describe method provides only the count but not the distinct count, and I wonder if there is a a way to get the distinct count for all (or some selected) columns.

推荐答案

多个聚合的计算成本非常高.我建议您改用近似方法.在这种情况下,近似非重复计数:

Multiple aggregations would be quite expensive to compute. I suggest that you use approximation methods instead. In this case, approxating distinct count:

val df = Seq((1,3,4),(1,2,3),(2,3,4),(2,3,5)).toDF("col1","col2","col3")

val exprs = df.columns.map((_ -> "approx_count_distinct")).toMap
df.agg(exprs).show()
// +---------------------------+---------------------------+---------------------------+
// |approx_count_distinct(col1)|approx_count_distinct(col2)|approx_count_distinct(col3)|
// +---------------------------+---------------------------+---------------------------+
// |                          2|                          2|                          3|
// +---------------------------+---------------------------+---------------------------+

approx_count_distinct 方法依赖于 HyperLogLog.

HyperLogLog 算法及其变体 HyperLogLog++(在 Spark 中实现)依赖于以下巧妙观察.

The HyperLogLog algorithm and its variant HyperLogLog++ (implemented in Spark) relies on the following clever observation.

如果数字在一个范围内均匀分布,则可以从数字的二进制表示中最大数量的前导零来近似计算不同元素的数量.

If the numbers are spread uniformly across a range, then the count of distinct elements can be approximated from the largest number of leading zeros in the binary representation of the numbers.

例如,如果我们观察一个数字,其二进制形式的数字是 0…(k times)…01…1 形式,那么我们可以估计有大约 2^k 个元素在集合中.这是一个非常粗略的估计,但可以使用草图算法将其细化到非常精确.

For example, if we observe a number whose digits in binary form are of the form 0…(k times)…01…1, then we can estimate that there are in the order of 2^k elements in the set. This is a very crude estimate but it can be refined to great precision with a sketching algorithm.

可以在原始论文.

注意:Spark 1.6 开始,当 Spark 分别调用 SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df子句应该为每个子句触发单独的聚合.而这与我们聚合一次的 SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df 不同.因此,当使用 count(distinct(_))approxCountDistinct(或 approx_count_distinct)时,性能将无法比拟.

Note: Starting Spark 1.6, when Spark calls SELECT SOME_AGG(DISTINCT foo)), SOME_AGG(DISTINCT bar)) FROM df each clause should trigger separate aggregation for each clause. Whereas this is different than SELECT SOME_AGG(foo), SOME_AGG(bar) FROM df where we aggregate once. Thus the performance won't be comparable when using a count(distinct(_)) and approxCountDistinct (or approx_count_distinct).

这是自 Spark 1.6 以来的行为变化之一:

随着针对具有不同聚合的查询的改进查询计划器 (SPARK-9241),具有单个不同聚合的查询计划已更改为更强大的版本.要切换回 Spark 1.5 的计划器生成的计划,请将 spark.sql.specializeSingleDistinctAggPlanning 设置为 true.(SPARK-12077)

With the improved query planner for queries having distinct aggregations (SPARK-9241), the plan of a query having a single distinct aggregation has been changed to a more robust version. To switch back to the plan generated by Spark 1.5’s planner, please set spark.sql.specializeSingleDistinctAggPlanning to true. (SPARK-12077)

参考:Apache Spark 中的近似算法:HyperLogLog 和分位数.

这篇关于Spark DataFrame:计算每列的不同值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆