处理大数据集时出现FetchFailedException或MetadataFetchFailedException [英] FetchFailedException or MetadataFetchFailedException when processing big data set

查看:319
本文介绍了处理大数据集时出现FetchFailedException或MetadataFetchFailedException的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我运行具有1 GB数据集的解析代码时,它会完成而没有任何错误.但是,当我一次尝试25 gb的数据时,我得到的错误更少.我试图了解如何避免出现以下故障.很高兴听到任何建议或想法.

When I run the parsing code with 1 GB dataset it completes without any error. But, when I attempt 25 gb of data at a time I get below errors. I'm trying to understand how can I avoid below failures. Happy to hear any suggestions or ideas.

不同的错误,

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0

org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxx

org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/mnt/yarn/nm/usercache/xxxx/appcache/application_1450751731124_8446/blockmgr-8a7b17b8-f4c3-45e7-aea8-8b0a7481be55/08/shuffle_0_224_0.data, offset=12329181, length=2104094}

集群详细信息:

纱线:8个节点
核心总数:64
内存:500 GB
Spark版本:1.5

Yarn: 8 Nodes
Total cores: 64
Memory: 500 GB
Spark Version: 1.5

火花提交声明:

spark-submit --master yarn-cluster \
                        --conf spark.dynamicAllocation.enabled=true \
                        --conf spark.shuffle.service.enabled=true \
                        --executor-memory 4g \
                        --driver-memory 16g \
                        --num-executors 50 \
                        --deploy-mode cluster \
                        --executor-cores 1 \
                        --class my.parser \
                        myparser.jar \
                        -input xxx \
                        -output xxxx \

堆栈跟踪之一:

at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:460)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:456)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:456)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:183)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

推荐答案

几乎可以肯定此错误是由执行程序上的内存问题引起的.我可以想到几种解决这类问题的方法.

This error is almost guaranteed to be caused by memory issues on your executors. I can think of a couple of ways to address these types of problems.

1)您可以尝试在更多分区上运行(在dataframe上执行repartition).当一个或多个分区包含的数据多于内存容量时,通常会出现内存问题.

1) You could try to run with more partitions (do a repartition on your dataframe). Memory issues typically arise when one or more partitions contain more data than will fit in memory.

2)我注意到您尚未明确设置spark.yarn.executor.memoryOverhead,因此它将默认设置为max(386, 0.10* executorMemory),在您的情况下为400MB.这对我来说听起来很低.我会尝试将其增加为1GB(请注意,如果将memoryOverhead增加到1GB,则需要将--executor-memory降低到3GB)

2) I'm noticing that you have not explicitly set spark.yarn.executor.memoryOverhead, so it will default to max(386, 0.10* executorMemory) which in your case will be 400MB. That sounds low to me. I would try to increase it to say 1GB (note that if you increase memoryOverhead to 1GB, you need to lower --executor-memory to 3GB)

3)在出现故障的节点上查找日志文件.您要查找文本杀死容器".如果您看到超出物理内存限制"的文本,以我的经验,增加memoryOverhead将解决问题.

3) Look in the log files on the failing nodes. You want to look for the text "Killing container". If you see the text "running beyond physical memory limits", increasing memoryOverhead will - in my experience - solve the problem.

这篇关于处理大数据集时出现FetchFailedException或MetadataFetchFailedException的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆