在Spark中，如何快速估算数据框中的元素数量 [英] In spark, how to estimate the number of elements in a dataframe quickly

查看：209 发布时间：2020/9/4 3:24:13 apache-spark approximation

本文介绍了在Spark中，如何快速估算数据框中的元素数量的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在spark中，有没有一种快速的方法来获取数据集中元素数量的近似计数?也就是说，比Dataset.count()快.

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.

也许我们可以根据数据集的分区数来计算此信息，可以吗?

Maybe we could calculate this information from the number of partitions of the DataSet, could we ?

推荐答案

您可以尝试在RDD API上使用countApprox，尽管这也会启动Spark作业，但它应该更快，因为它只是为您提供了一个估算值.您要花费的给定时间(毫秒)的真实计数和置信区间(即，真实值在该范围内的概率):

You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):

用法示例:

val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)

您必须使用参数timeout和confidence播放音乐.超时时间越长，估计的计数越准确.

You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.

这篇关于在Spark中，如何快速估算数据框中的元素数量的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Spark中，如何快速估算数据框中的元素数量 [英] In spark, how to estimate the number of elements in a dataframe quickly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在Spark中，如何快速估算数据框中的元素数量 [英] In spark, how to estimate the number of elements in a dataframe quickly

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭