为什么与 pandas 相比，Apache-Spark-Python在本地这么慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

查看：108 发布时间：2020/5/23 21:28:57 python pandas apache-spark pyspark apache-spark-sql

本文介绍了为什么与 pandas 相比，Apache-Spark-Python在本地这么慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这里是Spark新手. 我最近开始使用以下命令在两个内核的本地计算机上使用Spark:

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]

我有一个393Mb的文本文件，其中包含近一百万行.我想执行一些数据操作操作.我正在使用PySpark的内置数据框函数来执行groupBy，sum，max，stddev之类的简单操作.

I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

但是，当我在完全相同的数据集上对pandas执行完全相同的操作时，在延迟方面，pandas似乎大大击败了pyspark.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

我想知道这可能是什么原因.我有几点想法.

I was wondering what could be a possible reason for this. I have a couple of thoughts.

内置函数是否会使序列化/反序列化过程效率低下?如果是，那么还有哪些替代方案?
数据集是否太小以至于不能超过运行spark的底层JVM的开销成本?

感谢您的光临.非常感谢.

Thanks for looking. Much appreciated.

推荐答案

原因:

Apache Spark是一个复杂的框架，旨在在确保正确性和容错性的同时跨数百个节点分配处理.这些属性中的每一个都有很高的成本.
因为纯粹的内存中内核处理(Pandas)比磁盘和网络(甚至是本地)I/O(Spark)要快几个数量级.
因为并行性(和分布式处理)会增加大量开销，并且即使具有最佳性能(令人尴尬的并行工作负载)也无法保证任何性能改进.
因为本地模式不是为提高性能而设计的.用于测试.
最后但并非最不重要的-运行在393MB上的2个内核不足以看到任何性能改进，并且单个节点没有提供任何分发机会
也火花:在扩展内核数方面性能不一致，为什么我的Spark运行速度比纯Python慢?性能比较

Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
Because local mode is not designed for performance. It is used for testing.
Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison

您可以像这样继续很长时间...

You can go on like this for a long time...

这篇关于为什么与 pandas 相比，Apache-Spark-Python在本地这么慢?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么与 pandas 相比，Apache-Spark-Python在本地这么慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

为什么与 pandas 相比，Apache-Spark-Python在本地这么慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭