org.apache.spark.shuffle.MetadataFetchFailedException的可能原因是什么:缺少shuffle的输出位置? [英] What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

查看:656
本文介绍了org.apache.spark.shuffle.MetadataFetchFailedException的可能原因是什么:缺少shuffle的输出位置?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在EC2群集上部署Spark数据处理作业,该作业对于群集来说很小(总共16个内核,总共120G RAM),最大的RDD只有76k +行.但是中间位置严重偏斜(因此需要重新分区),序列化后每行大约有100k的数据.这项工作总是卡在重新分区中.即,作业将不断出现以下错误并重试:

I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries:

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle

org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer

org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...

我已经尝试找出问题所在,但是抛出这些错误的机器的内存和磁盘消耗似乎都低于50%.我还尝试了不同的配置,包括:

I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including:

let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4

但是它们都不起作用.小型作业始终会触发相同的一系列错误,并最大程度地重试(最多1000次).在这种情况下如何解决此问题?

But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation?

非常感谢.

推荐答案

检查日志是否出现与此类似的错误.

Check your log if you get an error similar to this.

ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated

每次您遇到此错误都是因为您丢失了执行程序.至于您为什么丢掉了遗嘱执行人的问题,那就是另一回事了,再次检查日志是否有线索.

Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.

如果Yarn认为看到您正在使用过多的内存",那么它会杀死您的工作

One thing Yarn can kill your job, if it thinks that see you are using "too much memory"

检查类似这样的内容:

org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl  - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.

另请参阅:

当前的最新状态是 spark.yarn.executor.memory开销直到作业停止失败.我们确实有 计划尝试根据内存量自动扩展 要求,但这仍然只是一种启发.

The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic.

这篇关于org.apache.spark.shuffle.MetadataFetchFailedException的可能原因是什么:缺少shuffle的输出位置?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆