Spark:sc.WholeTextFiles 需要很长时间才能执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

查看：31 发布时间：2021/12/15 19:20:40 scala hadoop optimization configuration apache-spark

本文介绍了Spark:sc.WholeTextFiles 需要很长时间才能执行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个集群，我执行 wholeTextFiles 这应该会提取大约一百万个文本文件，总计大约 10GB我有一个 NameNode 和两个 DataNode，每个带有 30GB RAM，每个有 4 个内核.数据存储在HDFS中.

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

我没有运行任何特殊的参数，这个工作需要 5 个小时才能读取数据.这是预期的吗?是否有任何参数可以加快读取速度(spark 配置或分区、执行程序数量?)

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

我刚刚开始，我以前从未需要优化工作

I'm just starting and I've never had the need to optimize a job before

另外，有人可以解释一下整个文本文件函数是如何工作的吗?(不是如何使用它，而是它是如何编程的).很想了解分区参数等

Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

编辑 2: 基准评估

所以我在wholeTextFile之后尝试重新分区，问题是一样的，因为第一次读取仍然使用预先定义的分区数，所以没有性能提升.加载数据后，集群执行得非常好......在整个TextFile上处理数据(对于200k文件)时，我收到以下警告消息:

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

这会是性能不佳的原因吗?我该如何对冲?

Would that be a reason of the bad performance? How do I hedge that?

此外，在执行 saveAsTextFile 时，根据 Ambari 控制台，我的速度为 19MB/s.使用 WholeTextFiles 读取时，我的速度为 300kb/s.....

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

似乎通过增加 wholeTextFile(path,partitions) 中的分区数，我获得了更好的性能.但仍然只有 8 个任务同时运行(我的 CPU 数量).我正在基准测试以观察极限...

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

Spark:sc.WholeTextFiles 需要很长时间才能执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:sc.WholeTextFiles 需要很长时间才能执行 [英] Spark: sc.WholeTextFiles takes a long time to execute

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭