如何避免 Spark 执行器因内存限制而丢失和纱线容器杀死它? [英] How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?

查看:15
本文介绍了如何避免 Spark 执行器因内存限制而丢失和纱线容器杀死它?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码大部分时间都会触发 hiveContext.sql() .我的任务是我想为所有配置单元表分区创建几个表并在处理后插入值.

I have the following code which fires hiveContext.sql() most of the time. My task is I want to create few tables and insert values into after processing for all hive table partition.

所以我首先触发 show partitions 并在 for 循环中使用它的输出,我调用一些方法来创建表(如果它不存在)并使用 hiveContext.sql.

So I first fire show partitions and using its output in a for-loop, I call a few methods which creates the table (if it doesn't exist) and inserts into them using hiveContext.sql.

现在,我们不能在执行器中执行hiveContext,所以我必须在驱动程序中的for循环中执行它,并且应该一个一个地串行运行.当我在 YARN 集群中提交这个 Spark 作业时,几乎所有的时间我的执行程序都会因为 shuffle not found 异常而丢失.

Now, we can't execute hiveContext in an executor, so I have to execute this in a for-loop in a driver program, and should run serially one by one. When I submit this Spark job in YARN cluster, almost all the time my executor gets lost because of shuffle not found exception.

现在发生这种情况是因为 YARN 由于内存过载而杀死了我的执行程序.我不明白为什么,因为每个 hive 分区都有一个非常小的数据集,但它仍然会导致 YARN 杀死我的执行程序.

Now this is happening because YARN is killing my executor because of memory overload. I don't understand why, as I have a very small data set for each hive partition, but still it causes YARN to kill my executor.

以下代码是否会并行执行所有操作并尝试将所有 hive 分区数据同时容纳在内存中?

Will the following code do everything in parallel and try to accommodate all hive partition data in memory at the same time?

public static void main(String[] args) throws IOException {   
    SparkConf conf = new SparkConf(); 
    SparkContext sc = new SparkContext(conf); 
    HiveContext hc = new HiveContext(sc); 

    DataFrame partitionFrame = hiveContext.sql(" show partitions dbdata partition(date="2015-08-05")"); 
  
    Row[] rowArr = partitionFrame.collect(); 
    for(Row row : rowArr) { 
        String[] splitArr = row.getString(0).split("/"); 
        String server = splitArr[0].split("=")[1]; 
        String date =  splitArr[1].split("=")[1]; 
        String csvPath = "hdfs:///user/db/ext/"+server+".csv"; 
        if(fs.exists(new Path(csvPath))) { 
            hiveContext.sql("ADD FILE " + csvPath); 
        } 
        createInsertIntoTableABC(hc,entity, date); 
        createInsertIntoTableDEF(hc,entity, date); 
        createInsertIntoTableGHI(hc,entity,date); 
        createInsertIntoTableJKL(hc,entity, date); 
        createInsertIntoTableMNO(hc,entity,date); 
   } 
}

推荐答案

通常,您应该始终深入研究日志以找出真正的异常(至少在 Spark 1.3.1 中).

Generally, you should always dig into logs to get the real exception out (at least in Spark 1.3.1).

tl;dr
Yarn 下 Spark 的安全配置
spark.shuffle.memoryFraction=0.5 - 这将允许 shuffle 使用更多分配的内存
spark.yarn.executor.memoryOverhead=1024 - 以 MB 为单位设置.当 Yarn 的内存使用量大于 (executor-memory + executor.memoryOverhead) 时,Yarn 会杀死 executor

tl;dr
safe config for Spark under Yarn
spark.shuffle.memoryFraction=0.5 - this would allow shuffle use more of allocated memory
spark.yarn.executor.memoryOverhead=1024 - this is set in MB. Yarn kills executors when its memory usage is larger then (executor-memory + executor.memoryOverhead)

更多信息

通过阅读您的问题,您提到您遇到了 shuffle not found 异常.

From reading your question you mention that you get shuffle not found exception.

在这种情况下org.apache.spark.shuffle.MetadataFetchFailedException:缺少随机播放的输出位置你应该增加 spark.shuffle.memoryFraction,例如到 0.5

In case of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle you should increase spark.shuffle.memoryFraction, for example to 0.5

Yarn 杀死我的执行程序的最常见原因是内存使用超出预期.为避免增加 spark.yarn.executor.memoryOverhead ,我将其设置为 1024,即使我的执行程序仅使用 2-3G 内存.

Most common reason for Yarn killing off my executors was memory usage beyond what it expected. To avoid that you increase spark.yarn.executor.memoryOverhead , I've set it to 1024, even if my executors use only 2-3G of memory.

这篇关于如何避免 Spark 执行器因内存限制而丢失和纱线容器杀死它?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆