为什么与 pandas 相比,Apache-Spark-Python在本地这么慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?
问题描述
这里是Spark新手. 我最近开始使用以下命令在两个内核的本地计算机上使用Spark:
A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:
pyspark --master local[2]
我有一个393Mb的文本文件,其中包含近一百万行.我想执行一些数据操作操作.我正在使用PySpark的内置数据框函数来执行groupBy
,sum
,max
,stddev
之类的简单操作.
I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy
, sum
, max
, stddev
.
但是,当我在完全相同的数据集上对pandas执行完全相同的操作时,在延迟方面,pandas似乎大大击败了pyspark.
However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.
我想知道这可能是什么原因.我有几点想法.
I was wondering what could be a possible reason for this. I have a couple of thoughts.
- 内置函数是否会使序列化/反序列化过程效率低下?如果是,那么还有哪些替代方案?
- 数据集是否太小以至于不能超过运行spark的底层JVM的开销成本?
感谢您的光临.非常感谢.
Thanks for looking. Much appreciated.
推荐答案
原因:
- Apache Spark是一个复杂的框架,旨在在确保正确性和容错性的同时跨数百个节点分配处理.这些属性中的每一个都有很高的成本.
- 因为纯粹的内存中内核处理(Pandas)比磁盘和网络(甚至是本地)I/O(Spark)要快几个数量级.
- 因为并行性(和分布式处理)会增加大量开销,并且即使具有最佳性能(令人尴尬的并行工作负载)也无法保证任何性能改进.
- 因为本地模式不是为提高性能而设计的.用于测试.
- 最后但并非最不重要的-运行在393MB上的2个内核不足以看到任何性能改进,并且单个节点没有提供任何分发机会
- 也火花:在扩展内核数方面性能不一致,为什么我的Spark运行速度比纯Python慢?性能比较
- Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
- Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
- Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
- Because local mode is not designed for performance. It is used for testing.
- Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
- Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison
您可以像这样继续很长时间...
You can go on like this for a long time...
这篇关于为什么与 pandas 相比,Apache-Spark-Python在本地这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!