如何有效地计算数据帧的行数? [英] How to calculate the number of rows of a dataframe efficiently?

查看:137
本文介绍了如何有效地计算数据帧的行数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个很大的pyspark数据框,我会计算行数,但是count()方法太慢.还有其他更快的方法吗?

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

推荐答案

如果您不介意获得近似计数,则可以尝试

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

在这种情况下,您可以将count()结果缩放2(或1/0.5).显然,这种方法存在统计错误.

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

这篇关于如何有效地计算数据帧的行数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆