为什么 Apache-Spark - Python 在本地比 Pandas 慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

查看：47 发布时间：2021/11/12 5:36:05 python pandas apache-spark pyspark apache-spark-sql

本文介绍了为什么 Apache-Spark - Python 在本地比 Pandas 慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里是 Spark 新手.我最近开始使用以下命令在本地机器上的两个内核上使用 Spark:

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]

我有一个 393Mb 的文本文件，其中几乎有一百万行.我想执行一些数据操作操作.我正在使用 PySpark 的内置数据帧函数来执行简单的操作，例如 groupBy、sum、max、stddev.

I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

然而，当我在完全相同的数据集上对 pandas 执行完全相同的操作时，pandas 似乎在延迟方面以巨大的优势击败了 pyspark.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

我想知道这可能是什么原因.我有一些想法.

I was wondering what could be a possible reason for this. I have a couple of thoughts.

内置函数是否低效地执行序列化/反序列化过程?如果是，它们的替代方案是什么?
数据集是否太小以至于无法超过运行 spark 的底层 JVM 的开销成本?

感谢您的关注.非常感谢.

Thanks for looking. Much appreciated.

为什么 Apache-Spark - Python 在本地比 Pandas 慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么 Apache-Spark - Python 在本地比 Pandas 慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭