在Spark中,如何快速估算数据框中的元素数量 [英] In spark, how to estimate the number of elements in a dataframe quickly

查看:209
本文介绍了在Spark中,如何快速估算数据框中的元素数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在spark中,有没有一种快速的方法来获取数据集中元素数量的近似计数?也就是说,比Dataset.count()快.

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.

也许我们可以根据数据集的分区数来计算此信息,可以吗?

Maybe we could calculate this information from the number of partitions of the DataSet, could we ?

推荐答案

您可以尝试在RDD API上使用countApprox,尽管这也会启动Spark作业,但它应该更快,因为它只是为您提供了一个估算值.您要花费的给定时间(毫秒)的真实计数和置信区间(即,真实值在该范围内的概率):

You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):

用法示例:

val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)

您必须使用参数timeoutconfidence播放音乐.超时时间越长,估计的计数越准确.

You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.

这篇关于在Spark中,如何快速估算数据框中的元素数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆