Spark 的 dataframe count() 函数需要很长时间 [英] Spark's dataframe count() function taking very long

查看:80
本文介绍了Spark 的 dataframe count() 函数需要很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的代码中,我有一系列数据帧,我想在其中过滤掉空的数据帧.我正在做类似的事情:

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:

Seq(df1, df2).map(df => df.count() > 0)

但是,这需要非常长的时间,并且大约需要 7 分钟来处理大约 2 个 100k 行的数据帧.

However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.

我的问题:为什么 Spark 的 count() 实现很慢.有解决办法吗?

My question: Why is Spark's implementation of count() is slow. Is there a work-around?

推荐答案

Count 是一个惰性操作.因此,您的数据框有多大并不重要.但是,如果您对数据进行了太多代价高昂的操作来获取此数据帧,那么一旦调用计数,spark 实际上会执行所有操作来获取这些数据帧.

Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.

一些代价高昂的操作可能是需要改组数据的操作.像groupBy,reduce等

Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.

所以我的猜测是你有一些复杂的处理来获取这些数据帧,或者你用来获取这个数据帧的初始数据太大了.

So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

这篇关于Spark 的 dataframe count() 函数需要很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆