为什么pyspark查找列的最大值这么慢? [英] Why is pyspark so much slower in finding the max of a column?

查看:335
本文介绍了为什么pyspark查找列的最大值这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一个一般性的解释,为什么spark需要更多的时间来计算列的最大值? 我导入了Kaggle Quora训练集(超过400.000行),并且我喜欢逐行特征提取时spark在做什么.但是现在我想手动"缩放列:找到列的最大值并除以该值. 我尝试了在Spark数据框列中获取最大值 https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html 我还尝试了df.toPandas(),然后以熊猫为单位计算最大值(您猜对了,df.toPandas花了很长时间.) 我唯一尝试过的就是RDD方式.

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column? I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value. I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.) The only thing I did ot try yet is the RDD way.

在我提供一些测试代码(我必须找出如何在spark中生成虚拟数据)之前,我想知道

Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know

  • 您能给我指点一下讨论这种差异的文章吗?
  • 与熊猫相比,spark对计算机的内存限制更敏感吗?

推荐答案

@MaxU,@MattR,我找到了一个中间解决方案,它也使我重新评估了Sparks惰性并更好地理解了问题.

@MaxU, @MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.

sc.accumulator可帮助我定义一个全局变量,并且使用单独的AccumulatorParam对象,可以动态计算列的最大值.

sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.

在测试这一点时,我注意到Spark甚至比预期的还要懒,因此我的原始帖子的这一部分我喜欢Spark在行特征提取时的功能"归结为我喜欢Spark完全没有做任何事情快".

In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.

另一方面,计算列的最大值所花费的时间最多可能是中间值的计算.

On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.

感谢您的投入,这个主题确实使我进一步了解了Spark.

Thanks for yourinput and this topic really got me much further in understanding Spark.

这篇关于为什么pyspark查找列的最大值这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆