“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR群集上 [英] "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

查看：315 发布时间：2020/8/23 2:06:59 apache-spark emr amazon-emr bigdata

本文介绍了“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR群集上的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在AWS EMR上运行一个大小为m3.xlarge的5节点Spark集群(1个主节点4个从节点).我成功地浏览了一个146Mb bzip2压缩CSV文件，最终得到了完美的汇总结果.

I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result.

现在，我正在尝试在此群集上处理〜5GB的bzip2 CSV文件，但是我收到此错误:

Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error:

16/11/23 17:29:53警告TaskSetManager:在阶段6.0中丢失了任务49.2(TID xxx，xxx.xxx.xxx.compute.internal):ExecutorLostFailure(由于正在运行的任务之一而导致执行器16退出)原因:容器因超出内存限制而被YARN杀死.已使用10.4 GB的10.4 GB物理内存.考虑提高spark.yarn.executor.memoryOverhead.

16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

我对为什么〜75GB群集(每3m.xlarge实例15GB)上的〜10.5GB内存限制感到困惑...

I'm confused as to why I'm getting a ~10.5GB memory limit on a ~75GB cluster (15GB per 3m.xlarge instance)...

这是我的EMR配置:

[
 {
  "classification":"spark-env",
  "properties":{

  },
  "configurations":[
     {
        "classification":"export",
        "properties":{
           "PYSPARK_PYTHON":"python34"
        },
        "configurations":[

        ]
     }
  ]
},
{
  "classification":"spark",
  "properties":{
     "maximizeResourceAllocation":"true"
  },
  "configurations":[

  ]
 }
]

据我所读，设置maximizeResourceAllocation属性应该告诉EMR配置Spark以充分利用群集上的所有可用资源.即，我应该有〜75GB的可用内存...那么为什么会出现〜10.5GB的内存限制错误? 这是我正在运行的代码:

From what I've read, setting the maximizeResourceAllocation property should tell EMR to configure Spark to fully utilize all resources available on the cluster. Ie, I should have ~75GB of memory available... So why am I getting a ~10.5GB memory limit error? Here is the code I'm running:

def sessionize(raw_data, timeout):
# https://www.dataiku.com/learn/guide/code/reshaping_data/sessionization.html
    window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
              .orderBy("timestamp"))
    diff = (pyspark.sql.functions.lag(raw_data.timestamp, 1)
            .over(window))
    time_diff = (raw_data.withColumn("time_diff", raw_data.timestamp - diff)
                 .withColumn("new_session", pyspark.sql.functions.when(pyspark.sql.functions.col("time_diff") >= timeout.seconds, 1).otherwise(0)))
    window = (pyspark.sql.Window.partitionBy("user_id", "site_id")
              .orderBy("timestamp")
              .rowsBetween(-1, 0))
    sessions = (time_diff.withColumn("session_id", pyspark.sql.functions.concat_ws("_", "user_id", "site_id", pyspark.sql.functions.sum("new_session").over(window))))
    return sessions
def aggregate_sessions(sessions):
    median = pyspark.sql.functions.udf(lambda x: statistics.median(x))
    aggregated = sessions.groupBy(pyspark.sql.functions.col("session_id")).agg(
        pyspark.sql.functions.first("site_id").alias("site_id"),
        pyspark.sql.functions.first("user_id").alias("user_id"),
        pyspark.sql.functions.count("id").alias("hits"),
        pyspark.sql.functions.min("timestamp").alias("start"),
        pyspark.sql.functions.max("timestamp").alias("finish"),
        median(pyspark.sql.functions.collect_list("foo")).alias("foo"),
    )
    return aggregated
 spark_context = pyspark.SparkContext(appName="process-raw-data")
spark_session = pyspark.sql.SparkSession(spark_context)
raw_data = spark_session.read.csv(sys.argv[1],
                                  header=True,
                                  inferSchema=True)
# Windowing doesn't seem to play nicely with TimestampTypes.
#
# Should be able to do this within the ``spark.read.csv`` call, I'd
# think. Need to look into it.
convert_to_unix = pyspark.sql.functions.udf(lambda s: arrow.get(s).timestamp)
raw_data = raw_data.withColumn("timestamp",
                               convert_to_unix(pyspark.sql.functions.col("timestamp")))
sessions = sessionize(raw_data, SESSION_TIMEOUT)
aggregated = aggregate_sessions(sessions)
aggregated.foreach(save_session)

基本上，无非就是窗口化和groupBy来聚合数据.

Basically, nothing more than windowing and a groupBy to aggregate the data.

它从其中一些错误开始，并逐渐停止增加相同错误的数量.

It starts with a few of those errors, and towards halting increases in the amount of the same error.

我尝试使用-conf spark.yarn.executor.memoryOverhead 运行spark-submit，但这似乎也不能解决问题.

I've tried running spark-submit with --conf spark.yarn.executor.memoryOverhead but that doesn't seem to solve the problem either.

“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR群集上 [英] "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR​​群集上 [英] &quot;Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used&quot; on an EMR cluster with 75GB of memory

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

“容器因超出内存限制而被YARN杀死. 10.4 GB使用的10.4 GB物理内存".在具有75GB内存的EMR群集上 [英] "Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used" on an EMR cluster with 75GB of memory

登录关闭