为什么与 pandas 相比,Apache-Spark-Python在本地这么慢? [英] Why is Apache-Spark - Python so slow locally as compared to pandas?

查看:108
本文介绍了为什么与 pandas 相比,Apache-Spark-Python在本地这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里是Spark新手. 我最近开始使用以下命令在两个内核的本地计算机上使用Spark:

A Spark newbie here. I recently started playing around with Spark on my local machine on two cores by using the command:

pyspark --master local[2]

我有一个393Mb的文本文件,其中包含近一百万行.我想执行一些数据操作操作.我正在使用PySpark的内置数据框函数来执行groupBysummaxstddev之类的简单操作.

I have a 393Mb text file which has almost a million rows. I wanted to perform some data manipulation operation. I am using the built-in dataframe functions of PySpark to perform simple operations like groupBy, sum, max, stddev.

但是,当我在完全相同的数据集上对pandas执行完全相同的操作时,在延迟方面,pandas似乎大大击败了pyspark.

However, when I do the exact same operations in pandas on the exact same dataset, pandas seems to defeat pyspark by a huge margin in terms of latency.

我想知道这可能是什么原因.我有几点想法.

I was wondering what could be a possible reason for this. I have a couple of thoughts.

  1. 内置函数是否会使序列化/反序列化过程效率低下?如果是,那么还有哪些替代方案?
  2. 数据集是否太小以至于不能超过运行spark的底层JVM的开销成本?

感谢您的光临.非常感谢.

Thanks for looking. Much appreciated.

推荐答案

原因:

  • Apache Spark是一个复杂的框架,旨在在确保正确性和容错性的同时跨数百个节点分配处理.这些属性中的每一个都有很高的成本.
  • 因为纯粹的内存中内核处理(Pandas)比磁盘和网络(甚至是本地)I/O(Spark)要快几个数量级.
  • 因为并行性(和分布式处理)会增加大量开销,并且即使具有最佳性能(令人尴尬的并行工作负载)也无法保证任何性能改进.
  • 因为本地模式不是为提高性能而设计的.用于测试.
  • 最后但并非最不重要的-运行在393MB上的2个内核不足以看到任何性能改进,并且单个节点没有提供任何分发机会
  • 火花:在扩展内核数方面性能不一致为什么我的Spark运行速度比纯Python慢​​?性能比较
  • Apache Spark is a complex framework designed to distribute processing across hundreds of nodes, while ensuring correctness and fault tolerance. Each of these properties has significant cost.
  • Because purely in-memory in-core processing (Pandas) is orders of magnitude faster than disk and network (even local) I/O (Spark).
  • Because parallelism (and distributed processing) add significant overhead, and even with optimal (embarrassingly parallel workload) does not guarantee any performance improvements.
  • Because local mode is not designed for performance. It is used for testing.
  • Last but not least - 2 cores running on 393MB is not enough to see any performance improvements, and single node doesn't provide any opportunity for distribution
  • Also Spark: Inconsistent performance number in scaling number of cores, Why is pyspark so much slower in finding the max of a column?, Why does my Spark run slower than pure Python? Performance comparison

您可以像这样继续很长时间...

You can go on like this for a long time...

这篇关于为什么与 pandas 相比,Apache-Spark-Python在本地这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆