org.apache.spark.shuffle.MetadataFetchFailedException的可能原因是什么:缺少shuffle的输出位置? [英] What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?
问题描述
我正在EC2群集上部署Spark数据处理作业,该作业对于群集来说很小(总共16个内核,总共120G RAM),最大的RDD只有76k +行.但是中间位置严重偏斜(因此需要重新分区),序列化后每行大约有100k的数据.这项工作总是卡在重新分区中.即,作业将不断出现以下错误并重试:
I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job always got stuck in repartitioning. Namely, the job will constantly get following errors and retries:
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer
org.apache.spark.shuffle.FetchFailedException: java.io.FileNotFoundException: /tmp/spark-...
我已经尝试找出问题所在,但是抛出这些错误的机器的内存和磁盘消耗似乎都低于50%.我还尝试了不同的配置,包括:
I've tried to identify the problem but it seems like both memory and disk consumption of the machine throwing these errors are below 50%. I've also tried different configurations, including:
let driver/executor memory use 60% of total memory.
let netty to priortize JVM shuffling buffer.
increase shuffling streaming buffer to 128m.
use KryoSerializer and max out all buffers
increase shuffling memoryFraction to 0.4
但是它们都不起作用.小型作业始终会触发相同的一系列错误,并最大程度地重试(最多1000次).在这种情况下如何解决此问题?
But none of them works. The small job always trigger the same series of errors and max out retries (upt to 1000 times). How to troubleshoot this thing in such situation?
非常感谢.
推荐答案
检查日志是否出现与此类似的错误.
Check your log if you get an error similar to this.
ERROR 2015-05-12 17:29:16,984 Logging.scala:75 - Lost executor 13 on node-xzy: remote Akka client disassociated
每次您遇到此错误都是因为您丢失了执行程序.至于您为什么丢掉了遗嘱执行人的问题,那就是另一回事了,再次检查日志是否有线索.
Every time you get this error is because you lose an executor. As why you lost an executor, that is another story, again check your log for clues.
如果Yarn认为看到您正在使用过多的内存",那么它会杀死您的工作
One thing Yarn can kill your job, if it thinks that see you are using "too much memory"
检查类似这样的内容:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl - Container [<edited>] is running beyond physical memory limits. Current usage: 18.0 GB of 18 GB physical memory used; 19.4 GB of 37.8 GB virtual memory used. Killing container.