Spark:sc.WholeTextFiles 需要很长时间才能执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

查看:31
本文介绍了Spark:sc.WholeTextFiles 需要很长时间才能执行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个集群,我执行 wholeTextFiles 这应该会提取大约一百万个文本文件,总计大约 10GB我有一个 NameNode 和两个 DataNode,每个带有 30GB RAM,每个有 4 个内核.数据存储在HDFS中.

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

我没有运行任何特殊的参数,这个工作需要 5 个小时才能读取数据.这是预期的吗?是否有任何参数可以加快读取速度(spark 配置或分区、执行程序数量?)

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

我刚刚开始,我以前从未需要优化工作

I'm just starting and I've never had the need to optimize a job before

另外,有人可以解释一下整个文本文件函数是如何工作的吗?(不是如何使用它,而是它是如何编程的).很想了解分区参数等

Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

编辑 2: 基准评估

所以我在wholeTextFile之后尝试重新分区,问题是一样的,因为第一次读取仍然使用预先定义的分区数,所以没有性能提升.加载数据后,集群执行得非常好......在整个TextFile上处理数据(对于200k文件)时,我收到以下警告消息:

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

这会是性能不佳的原因吗?我该如何对冲?

Would that be a reason of the bad performance? How do I hedge that?

此外,在执行 saveAsTextFile 时,根据 Ambari 控制台,我的速度为 19MB/s.使用 WholeTextFiles 读取时,我的速度为 300kb/s.....

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

似乎通过增加 wholeTextFile(path,partitions) 中的分区数,我获得了更好的性能.但仍然只有 8 个任务同时运行(我的 CPU 数量).我正在基准测试以观察极限...

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

推荐答案

从评论中总结我的建议:

To summarize my recommendations from the comments:

  1. HDFS 不适合存储许多小文件.首先,NameNode 将元数据存储在内存中,因此您可能拥有的文件和块的数量是有限的(对于典型的服务器来说,大约 100m 块是最大的).接下来,每次读取文件时,首先查询 NameNode 以获取块位置,然后连接到存储文件的 DataNode.这种联系和响应的开销真的很大.
  2. 应始终检查默认设置.默认情况下,Spark 在 YARN 上启动,有 2 个执行程序 (--num-executors),每个执行程序有 1 个线程 (--executor-cores) 和 512m 的 RAM (--executor-memory),只给你 2 个线程,每个线程有 512MB RAM,这对于现实世界的任务来说真的很小
  1. HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
  2. Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks

所以我的建议是:

  1. 使用 --num-executors 4 --executor-memory 12g --executor-cores 4 启动 Spark,这会给你更多的并行性 - 在这种特殊情况下有 16 个线程,这意味着 16 个任务正在运行并行
  2. 使用 sc.wholeTextFiles 读取文件,然后将它们转储到压缩的序列文件中(例如,使用 Snappy 块级压缩),这是如何完成的示例:http://0x0fff.com/spark-hdfs-integration/.这将大大减少下一次迭代读取它们所需的时间
  1. Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
  2. Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration

这篇关于Spark:sc.WholeTextFiles 需要很长时间才能执行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆