Apache Spark在reduceByKey步骤上缓慢 [英] Apache Spark slow on reduceByKey step

查看：385 发布时间：2020/9/4 9:07:58 python performance apache-spark pyspark

本文介绍了Apache Spark在reduceByKey步骤上缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在/usr/local/share/data/中有一个2MB的纯文本文件. 然后，我在Apache Spark中针对以下代码运行.

I have a 2MB plain text file in /usr/local/share/data/. And then I run against the following code in Apache Spark.

conf = SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g") 
sc = SparkContext(conf=conf)
doc_rdd = sc.textFile("/usr/local/share/data/")
unigrams = doc_rdd.flatMap(word_tokenize)
step1 = unigrams.flatMap(word_pos_tagging)
step2 = step1.filter(lambda x: filter_punctuation(x[0]))
step3 = step2.map(lambda x: (x, 1))
freq_unigrams = step3.reduceByKey(lambda x, y: x + y)

预期结果

[((u'showing', 'VBG'), 24), ((u'Ave', 'NNP'), 1), ((u'Scrilla364', 'NNP'), 1), ((u'internally', 'RB'), 4), ...]

但是返回期望的字数需要非常长的时间(6分钟). 它停留在reduceByKey步骤上. 如何解决此性能问题?

But it takes very very long time (6 minutes) to return the expected word count. It stucks at reduceByKey steps. How to resolve this performance issue?

-参考-

硬件规范

型号名称:MacBook Air型号识别码:MacBookAir4,2 处理器名称:Intel Core i7处理器速度:1.8 GHz 处理器数量:1 核心总数:2 L2缓存(每核):256 KB L3快取:4 MB 内存:4 GB

Model Name: MacBook Air Model Identifier: MacBookAir4,2 Processor Name: Intel Core i7 Processor Speed: 1.8 GHz Number of Processors: 1 Total Number of Cores: 2 L2 Cache (per Core): 256 KB L3 Cache: 4 MB Memory: 4 GB

15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:0+873602
15/10/02 16:05:12 INFO HadoopRDD: Input split: file:/usr/local/share/data/enronsent01:873602+873602
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:53478 in memory (size: 4.1 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:53478 in memory (size: 4.6 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 4
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 3
15/10/02 16:09:11 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:53478 in memory (size: 3.9 KB, free: 530.0 MB)
15/10/02 16:09:11 INFO ContextCleaner: Cleaned accumulator 2
15/10/02 16:10:05 INFO PythonRDD: Times: total = 292892, boot = 8, init = 275, finish = 292609
15/10/02 16:10:05 INFO Executor: Finished task 1.0 in stage 3.0 (TID 4). 2373 bytes result sent to driver
15/10/02 16:10:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID 4) in 292956 ms on localhost (1/2)
15/10/02 16:10:35 INFO PythonRDD: Times: total = 322562, boot = 5, init = 276, finish = 322281
15/10/02 16:10:35 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 2373 bytes result sent to driver
15/10/02 16:10:35 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 322591 ms on localhost (2/2)

推荐答案

代码看起来不错.

您可以尝试几种方法来提高性能.

You can try few options to improve performance.

SparkConf().setMaster("local[*]").setAppName("test").set("spark.executor.memory", "2g")

local-> local[*]如果任务已中断-它会占用计算机上可用的内核数量，
并在可能的情况下增加程序可用的内存

local -> local[*] if the task is broken - it can take the number of core available on the machine
And if possible increase the memory available to the program

P.S.并且要感谢Spark-您应该有大量的数据，以便可以在集群上运行

P.S. And to appreciate Spark - you should have a good amount of data so as you can run it on cluster

这篇关于Apache Spark在reduceByKey步骤上缓慢的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark在reduceByKey步骤上缓慢 [英] Apache Spark slow on reduceByKey step

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Apache Spark在reduceByKey步骤上缓慢 [英] Apache Spark slow on reduceByKey step

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭