为什么我的Spark运行速度比纯Python慢​​?性能比较 [英] Why does my Spark run slower than pure Python? Performance comparison

查看:457
本文介绍了为什么我的Spark运行速度比纯Python慢​​?性能比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这里火花新手.我试图使用Spark在数据框上执行一些pandas操作,但令人惊讶的是它比纯Python慢​​(即在Python中使用pandas包).这是我所做的:

Spark newbie here. I tried to do some pandas action on my data frame using Spark, and surprisingly it's slower than pure Python (i.e. using pandas package in Python). Here's what I did:

1) 在Spark中:

1) In Spark:

train_df.filter(train_df.gender == '-unknown-').count()

大约需要30秒才能获得结果.但是使用Python大约需要1秒钟.

It takes about 30 seconds to get results back. But using Python it takes about 1 second.

2)在Spark中:

2) In Spark:

sqlContext.sql("SELECT gender, count(*) FROM train GROUP BY gender").show()

同一件事,在Spark中花费大约30秒,在Python中花费1秒.

Same thing, takes about 30 sec in Spark, 1 sec in Python.

Spark比纯Python慢​​得多的几种可能原因:

Several possible reasons my Spark is much slower than pure Python:

1)我的数据集大约有220,000条记录(24 MB),这还不足以显示Spark的扩展优势.

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

2)我的spark在本地运行,我应该在类似Amazon EC的环境中运行它.

2) My spark is running locally and I should run it in something like Amazon EC instead.

3)可以在本地运行,但是我的计算能力却不足.这是8 Gig RAM 2015 Macbook.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

4)Spark运行缓慢,因为我正在运行Python.如果我使用Scala,那就更好了. (反对派观点:听说很多人都在使用PySpark很好.)

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

以下哪个是最可能的原因,或者是最可信的解释?我很想听听一些Spark专家的意见.非常感谢!!

Which one of these is most likely the reason, or the most credible explanation? I would love to hear from some Spark experts. Thank you very much!!

推荐答案

在较小的数据集上,与pyspark相比,Python的性能肯定更好.在处理更大的数据集时,您会看到不同之处.

Python will definitely perform better compared to pyspark on smaller data sets. You will see the difference when you are dealing with larger data sets.

默认情况下,当您在SQL Context或Hive Context中运行spark时,默认情况下它将使用200个分区.您需要使用sqlContext.sql("set spark.sql.shuffle.partitions=10");将其更改为10或任何值.肯定会比默认情况下更快.

By default when you run spark in SQL Context or Hive Context it will use 200 partitions by default. You need to change it to 10 or what ever valueby using sqlContext.sql("set spark.sql.shuffle.partitions=10");. It will be definitely faster than with default.

1)我的数据集大约有220,000条记录(24 MB),这并不大 足够的数据集来显示Spark的缩放优势.

1) My dataset is about 220,000 records, 24 MB, and that's not a big enough dataset to show the scaling advantages of Spark.

是的,在较低的音量下您不会看到太大的差异. Spark也会变慢.

You are right, you will not see much difference at lower volumes. Spark can be slower as well.

2)我的火花在本地运行,我应该以类似的方式运行它 改为使用Amazon EC.

2) My spark is running locally and I should run it in something like Amazon EC instead.

对于您的音量而言,它可能无济于事.

For your volume it might not help much.

3)可以在本地运行,但是我的计算能力却无法降低 它.这是8 Gig RAM 2015 Macbook.

3) Running locally is okay, but my computing capacity just doesn't cut it. It's a 8 Gig RAM 2015 Macbook.

同样,对于20MB数据集也没关系.

Again it does not matter for 20MB data set.

4)Spark运行缓慢,因为我正在运行Python.如果我正在使用Scala 会更好. (反对派观点:听说很多人在用 PySpark很好.)

4) Spark is slow because I'm running Python. If I'm using Scala it would be much better. (Con argument: I heard lots of people are using PySpark just fine.)

单独运行会有所不同. Python比scala具有更多的运行时开销,但是在具有分布式功能的大型集群上,这没关系

On stand alone there will be difference. Python has more run time overhead than scala, but on larger cluster with distributed capability it need not matter

这篇关于为什么我的Spark运行速度比纯Python慢​​?性能比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆