Spark的dataframe count()函数花费很长时间 [英] Spark's dataframe count() function taking very long

查看:2070
本文介绍了Spark的dataframe count()函数花费很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的代码中,我有一个数据帧序列,我要过滤掉这些数据帧为空.我正在做类似的事情:

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:

Seq(df1, df2).map(df => df.count() > 0)

但是,这要花费非常长的时间,并且大约要花7分钟才能读取大约2个数据帧(每行10万行).

However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.

我的问题:为什么Spark的count()实现缓慢.有解决方法吗?

My question: Why is Spark's implementation of count() is slow. Is there a work-around?

推荐答案

Count是一个惰性操作.因此,您的数据帧有多大无关紧要.但是,如果对数据进行太多昂贵的操作以获取此数据帧,则一旦计数被称为spark,实际上将执行所有操作以获取这些数据帧.

Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.

某些昂贵的操作可能是需要改组数据的操作.像groupBy一样,减少等.

Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.

所以我的猜测是您需要进行一些复杂的处理才能获取这些数据帧,或者用于获取此数据帧的初始数据太大.

So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

这篇关于Spark的dataframe count()函数花费很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆