如何有效地计算数据帧的行数? [英] How to calculate the number of rows of a dataframe efficiently?

查看：21 发布时间：2021/11/14 23:31:40 apache-spark pyspark apache-spark-sql

本文介绍了如何有效地计算数据帧的行数?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我有一个非常大的 pyspark 数据框，我会计算行数，但是 count() 方法太慢了.还有其他更快的方法吗?

解决方案

如果你不介意得到一个大概的数量，你可以试试首先对数据集进行采样，然后按您的采样因子进行缩放:

<预><代码>>>>df = spark.range(10)>>>df.sample(0.5).count()4

在这种情况下，您可以将 count() 结果缩放 2(或 1/0.5).显然，这种方法存在统计误差.

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

解决方案

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

这篇关于如何有效地计算数据帧的行数?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文