星火：sc.WholeTextFiles需要很长的时间来执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

查看：1944 发布时间：2016/5/22 15:37:11 scala hadoop optimization configuration apache-spark

本文介绍了星火：sc.WholeTextFiles需要很长的时间来执行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个集群，我执行 wholeTextFiles 应该拉一百万个文本文件，谁总结起来大约 10GB 总
我有一个的NameNode和两个与的DataNode 的RAM 30GB 每次，每4个内核。该数据被存储在 HDFS 。

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

我不运行任何特殊的参数和工作需要5个小时，只是读取数据。那是正常吗？是否有应加快读取任何参数（火花配置或分区，执行者的数量？）

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

我刚刚开始，我从来没有需要优化之前

I'm just starting and I've never had the need to optimize a job before

编辑：此外，有人可以解释wholeTextFiles功能是如何工作的？（而不是如何使用它，但它是如何编程）。我很感兴趣，了解分区参数等。

Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

编辑2：基准评估

所以我的wholeTextFile后试图重新分区，问题是相同的，因为第一次读依然采用分区的pre-定义数字，所以没有性能提升。一旦加载了集群中的数据执行得很好......我有一个数据处理（对于200K文件）时，下面的警告消息，在wholeTextFile：

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

请问这是糟糕表现的一个原因？我该如何规避呢？

Would that be a reason of the bad performance? How do I hedge that?

此外，做一个saveAsTextFile时，根据Ambari控制台我的速度是19MB /秒。在做与wholeTextFiles读，我在300KB / S .....

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

如此看来，通过增加分区的 wholeTextFile（路径，分区），我得到更好的性能的数量。但仍然只有8任务是在同一时间（我的电话号码的CPU）上运行。我标杆，观察了极限......

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

星火：sc.WholeTextFiles需要很长的时间来执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火：sc.WholeTextFiles需要很长的时间来执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭