为什么我的 Spark 运行速度比纯 Python 慢?性能对比 [英] Why does my Spark run slower than pure Python? Performance comparison

查看:73
本文介绍了为什么我的 Spark 运行速度比纯 Python 慢?性能对比的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这里激发新手.我尝试使用 Spark 对我的数据框执行一些 Pandas 操作,但令人惊讶的是它比纯 Python 慢(即在 Python 中使用 Pandas 包).这是我所做的:

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:

1)在 Spark 中:

1) In Spark:

train_df.filter(train_df.gender == '-unknown-').count()

返回结果大约需要 30 秒.但是使用 Python 大约需要 1 秒.

It takes about 30 seconds to get results back. But using Python it takes about 1 second.

2) 在 Spark 中:

2) In Spark:

sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()

同样的事情,在 Spark 中大约需要 30 秒,在 Python 中大约需要 1 秒.

Same thing, takes about 30 sec in Spark, 1 sec in Python.

我的 Spark 比纯 Python 慢得多的几个可能原因:

Several possible reasons my Spark is much slower than pure Python:

1) 我的数据集大约有 220,000 条记录,24 MB,这不足以显示 Spark 的扩展优势.

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

2) 我的 spark 在本地运行,我应该在 Amazon EC 之类的地方运行它.

2) My spark is running locally and I should run it in something like Amazon EC instead.

3) 在本地运行是可以的,但我的计算能力并没有削减它.这是 8 Gig RAM 2015 Macbook.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

4) Spark 很慢,因为我正在运行 Python.如果我使用 Scala 会好得多.(反对意见:我听说很多人都在使用 PySpark.)

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

以下哪一项是最有可能的原因,或者最可信的解释?我很想听听一些 Spark 专家的意见.非常感谢!!

Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

推荐答案

Python 在较小的数据集上肯定会比 pyspark 表现得更好.当您处理更大的数据集时,您会看到不同之处.

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.

默认情况下,当您在 SQL 上下文或 Hive 上下文中运行 spark 时,它将默认使用 200 个分区.您需要使用 sqlContext.sql("set spark.sql.shuffle.partitions=10"); 将其更改为 10 或任何值.它肯定会比默认更快.

By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.

1) 我的数据集大约有 220,000 条记录,24 MB,这不算大足够的数据集来展示 Spark 的扩展优势.

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

您是对的,在较低的音量下您不会看到太大的差异.Spark 也可能更慢.

You are right, you will not see much difference at lower volumes. Spark can be slower as well.

2) 我的 spark 是在本地运行的,我应该以类似的方式运行它取而代之的是亚马逊电子商务.

2) My spark is running locally and I should run it in something like Amazon EC instead.

对于您的音量,它可能没有多大帮助.

For your volume it might not help much.

3) 本地运行没问题,但我的计算能力并没有减少它.这是 8 Gig RAM 2015 Macbook.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

同样,20MB 的数据集也无关紧要.

Again it does not matter for 20MB data set.

4) Spark 很慢,因为我正在运行 Python.如果我使用 Scala 它会好很多.(反对意见:我听说很多人都在使用PySpark 就好了.)

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

单独使用会有区别.Python 的运行时间开销比 scala 多,但在具有分布式能力的大型集群上,这无关紧要

On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

这篇关于为什么我的 Spark 运行速度比纯 Python 慢?性能对比的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆