“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR​​群集上 [英] "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

查看:315
本文介绍了“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR​​群集上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在AWS EMR上运行一个大小为m3.xlarge的5节点Spark集群(1个主节点4个从节点).我成功地浏览了一个146Mb bzip2压缩CSV文件,最终得到了完美的汇总结果.

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.

现在,我正在尝试在此群集上处理〜5GB的bzip2 CSV文件,但是我收到此错误:

Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:

16/11/23 17:29:53警告TaskSetManager:在阶段6.0中丢失了任务49.2(TID xxx,xxx.xxx.xxx.compute.internal):ExecutorLostFailure(由于正在运行的任务之一而导致执行器16退出)原因:容器因超出内存限制而被YARN杀死.已使用10.4 GB的10.4 GB物理内存.考虑提高spark.yarn.executor.memoryOverhead.

16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

我对为什么〜75GB群集(每3m.xlarge实例15GB)上的〜10.5GB内存限制感到困惑...

I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...

这是我的EMR配置:

[
 {
  "classification":"spark-env",
  "properties":{

  },
  "configurations":[
     {
        "classification":"export",
        "properties":{
           "PYSPARK_PYTHON":"python34"
        },
        "configurations":[

        ]
     }
  ]
},
{
  "classification":"spark",
  "properties":{
     "maximizeResourceAllocation":"true"
  },
  "configurations":[

  ]
 }
]

据我所读,设置maximizeResourceAllocation属性应该告诉EMR配置Spark以充分利用群集上的所有可用资源.即,我应该有〜75GB的可用内存...那么为什么会出现〜10.5GB的内存限制错误? 这是我正在运行的代码:

From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error? Here is the code I'm running:

def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
    window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
              .orderBy("timestamp"))
    diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
            .over(window))
    time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
                 .withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
    window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
              .orderBy("timestamp")
              .rowsBetween(-1, 0))
    sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
    return sessions
def aggregate_sessions(sessions):
    median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
    aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
        pyspark.sql.functions.first("site_id").alias("site_id"),
        pyspark.sql.functions.first("user_id").alias("user_id"),
        pyspark.sql.functions.count("id").alias("hits"),
        pyspark.sql.functions.min("timestamp").alias("start"),
        pyspark.sql.functions.max("timestamp").alias("finish"),
        median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
    )
    return aggregated
 spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
                                  header=True,
                                  inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
                               convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)

基本上,无非就是窗口化和groupBy来聚合数据.

Basically, nothing more than windowing and a groupBy to aggregate the data.

它从其中一些错误开始,并逐渐停止增加相同错误的数量.

It starts with a few of those errors, and towards halting increases in the amount of the same error.

我尝试使用-conf spark.yarn.executor.memoryOverhead 运行spark-submit,但这似乎也不能解决问题.

I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.

推荐答案

我感到你很痛苦.

在YARN上使用Spark时,我们遇到了内存不足的问题.我们有5个64GB,16个核心VM,并且无论我们将spark.yarn.executor.memoryOverhead设置为什么,我们都无法为这些任务获得足够的内存-不管我们给它们提供多少内存,它们最终都会消失.这是一个相对简单的Spark应用程序,导致了这种情况的发生.

We had similar issues of running out of memory with Spark on YARN. We have five 64GB, 16 core VMs and regardless of what we set spark.yarn.executor.memoryOverhead to, we just couldn't get enough memory for these tasks -- they would eventually die no matter how much memory we would give them. And this as a relatively straight-forward Spark application that was causing this to happen.

我们发现VM上的物理内存使用率很低,但是虚拟内存使用率却很高(尽管日志抱怨 physical 内存).我们在yarn-site.xml中将yarn.nodemanager.vmem-check-enabled设置为false,我们的容器不再被杀死,该应用程序似乎按预期运行.

We figured out that the physical memory usage was quite low on the VMs but the virtual memory usage was extremely high (despite the logs complaining about physical memory). We set yarn.nodemanager.vmem-check-enabled in yarn-site.xml to false and our containers were no longer killed, and the application appeared to work as expected.

做更多研究后,我在这里找到了为什么会这样的答案: https://www.mapr.com/blog/best-practices-yarn-resource-management

Doing more research, I found the answer to why this happens here: https://www.mapr.com/blog/best-practices-yarn-resource-management

由于在Centos/RHEL 6上由于操作系统的行为会大量分配虚拟内存,因此您应该禁用虚拟内存检查程序或将yarn.nodemanager.vmem-pmem-ratio增加到一个相对较大的值.

该页面具有指向IBM非常有用的页面的链接:

That page had a link to a very useful page from IBM: https://www.ibm.com/developerworks/community/blogs/kevgrig/entry/linux_glibc_2_10_rhel_6_malloc_may_show_excessive_virtual_memory_usage?lang=en

总而言之,glibc> 2.10更改了其内存分配.而且,尽管分配的大量虚拟内存并不是世界末日,但它不适用于YARN的默认设置.

In summary, glibc > 2.10 changed its memory allocation. And although huge amounts of virtual memory being allocated isn't the end of the world, it doesn't work with the default settings of YARN.

除了将yarn.nodemanager.vmem-check-enabled设置为false ,您还可以在hadoop-env.sh中将MALLOC_ARENA_MAX环境变量设置为较小的数字.此错误报告提供了有关以下方面的有用信息: https://issues.apache.org/jira/浏览/HADOOP-7154

Instead of setting yarn.nodemanager.vmem-check-enabled to false, you could also play with setting the MALLOC_ARENA_MAX environment variable to a low number in hadoop-env.sh. This bug report has helpful information about that: https://issues.apache.org/jira/browse/HADOOP-7154

我建议您仔细阅读这两页-信息非常方便.

I recommend reading through both pages -- the information is very handy.

这篇关于“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR​​群集上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆