为什么pyspark查找列的最大值这么慢? [英] Why is pyspark so much slower in finding the max of a column?

查看：335 发布时间：2020/5/8 0:14:12 pandas max spark-dataframe

本文介绍了为什么pyspark查找列的最大值这么慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

是否有一个一般性的解释，为什么spark需要更多的时间来计算列的最大值? 我导入了Kaggle Quora训练集(超过400.000行)，并且我喜欢逐行特征提取时spark在做什么.但是现在我想手动"缩放列:找到列的最大值并除以该值. 我尝试了在Spark数据框列中获取最大值和 https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html 我还尝试了df.toPandas()，然后以熊猫为单位计算最大值(您猜对了，df.toPandas花了很长时间.) 我唯一尝试过的就是RDD方式.

Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column? I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value. I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.) The only thing I did ot try yet is the RDD way.

在我提供一些测试代码(我必须找出如何在spark中生成虚拟数据)之前，我想知道

Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know

您能给我指点一下讨论这种差异的文章吗?
与熊猫相比，spark对计算机的内存限制更敏感吗?

为什么pyspark查找列的最大值这么慢? [英] Why is pyspark so much slower in finding the max of a column?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么pyspark查找列的最大值这么慢? [英] Why is pyspark so much slower in finding the max of a column?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭