星火:sc.WholeTextFiles需要很长的时间来执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

查看:1944
本文介绍了星火:sc.WholeTextFiles需要很长的时间来执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个集群,我执行 wholeTextFiles 应该拉一百万个文本文件,谁总结起来大约 10GB
我有一个的NameNode和两个与的DataNode 的RAM 30GB 每次,每4个内核。该数据被存储在 HDFS

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

我不运行任何特殊的参数和工作需要5个小时,只是读取数据。那是正常吗?是否有应加快读取任何参数(火花配置或分区,执行者的数量?)

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

我刚刚开始,我从来没有需要优化之前

I'm just starting and I've never had the need to optimize a job before

编辑:此外,有人可以解释wholeTextFiles功能是如何工作的? (而不是如何使用它,但它是如何编程)。我很感兴趣,了解分区参数等。

Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

编辑2:基准评估

所以我的wholeTextFile后试图重新分区,问题是相同的,因为第一次读依然采用分区的pre-定义数字,所以没有性能提升。一旦加载了集群中的数据执行得很好......我有一个数据处理(对于200K文件)时,下面的警告消息,在wholeTextFile:

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

请问这是糟糕表现的一个原因?我该如何规避呢?

Would that be a reason of the bad performance? How do I hedge that?

此外,做一个saveAsTextFile时,根据Ambari控制台我的速度是19MB /秒。在做与wholeTextFiles读,我在300KB / S .....

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

如此看来,通过增加分区的 wholeTextFile(路径,分区),我得到更好的性能的数量。但仍然只有8任务是在同一时间(我的电话号码的CPU)上运行。我标杆,观察了极限......

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

推荐答案

要从评论总结我的建议:

To summarize my recommendations from the comments:


  1. HDFS是不适合用于存储许多小文件。首先,在内存中的NameNode存储元数据,以便文件,你可能有块的数量是有限的(〜百米块是典型的服务器最大值)。接下来,每次读取文件,你块位置的第一个查询的NameNode一次,然后连接到的DataNode存储文件。这种连接和响应的开销实在是巨大的。

  2. 默认设置应始终进行审查。由默认的Spark在纱线开始用2执行人( - NUM-执行人)1线每个( - 遗嘱执行人 - 内核)和512M的RAM( - 遗嘱执行人 - 内核),让您每512MB RAM,这是现实世界的任务真的很小
  1. HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
  2. Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-cores), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks

所以我的建议是:


  1. 开始以星火 - NUM-执行人4 --executor内存12克--executor-4内核这会给你更多的并行 - 16线程在这种特殊情况下,这意味着并行
  2. 运行16任务
  3. 使用 sc.wholeTextFiles 来读取文件,然后将它们转储到COM pressed序列文件(例如,与斯纳皮块级COM pression) ,这里有一个如何可以做到这一点的例子: http://0x0fff.com/spark-hdfs-integration/ 。这将大大减少所需要的时间与下一个迭代读取它们

  1. Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
  2. Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration

这篇关于星火:sc.WholeTextFiles需要很长的时间来执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆