在Spark中,如何快速估算数据框中的元素数量 [英] In spark, how to estimate the number of elements in a dataframe quickly
问题描述
在spark中,有没有一种快速的方法来获取数据集中元素数量的近似计数?也就是说,比Dataset.count()
快.
In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count()
does.
也许我们可以根据数据集的分区数来计算此信息,可以吗?
Maybe we could calculate this information from the number of partitions of the DataSet, could we ?
推荐答案
您可以尝试在RDD API上使用countApprox
,尽管这也会启动Spark作业,但它应该更快,因为它只是为您提供了一个估算值.您要花费的给定时间(毫秒)的真实计数和置信区间(即,真实值在该范围内的概率):
You could try to use countApprox
on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
用法示例:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
您必须使用参数timeout
和confidence
播放音乐.超时时间越长,估计的计数越准确.
You have to play a bit with the parameters timeout
and confidence
. The higher the timeout, the more accurate is the estimated count.
这篇关于在Spark中,如何快速估算数据框中的元素数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!