不断增加物理内存的在纱线火花应用 [英] Ever increasing physical memory for a Spark Application in YARN

查看:333
本文介绍了不断增加物理内存的在纱线火花应用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行的有2执行人用的Xms / XMX 32个演出和spark.yarn.excutor.memoryOverhead为6演出YARN火花的应用程序。

I am running a spark application in YARN having 2 executors with Xms/Xmx as 32 Gigs and spark.yarn.excutor.memoryOverhead as 6 gigs.

我看到了应用程序的物理内存的要求不断提高,最终得到由节点管理器杀

I am seeing that the app's physical memory is ever increasing and finally gets killed by node manager

2015-07-25 15:07:05,354 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=10508,containerID=container_1437828324746_0002_01_000003] is running beyond physical memory limits. Current usage: 38.0 GB of 38 GB physical memory used; 39.5 GB of 152 GB virtual memory used. Killing container.
Dump of the process-tree for container_1437828324746_0002_01_000003 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 10508 9563 10508 10508 (bash) 0 0 9433088 314 /bin/bash -c /usr/java/default/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms32768m -Xmx32768m  -Dlog4j.configuration=log4j-executor.properties -XX:MetaspaceSize=512m -XX:+UseG1GC -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc.log -XX:AdaptiveSizePolicyOutputInterval=1  -XX:+UseGCLogFileRotation -XX:GCLogFileSize=500M -XX:NumberOfGCLogFiles=1 -XX:MaxDirectMemorySize=3500M -XX:NewRatio=3 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=36082 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -XX:NativeMemoryTracking=detail -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=512m -XX:CompressedClassSpaceSize=256m -Djava.io.tmpdir=/data/yarn/datanode/nm-local-dir/usercache/admin/appcache/application_1437828324746_0002/container_1437828324746_0002_01_000003/tmp '-Dspark.driver.port=43354' -Dspark.yarn.app.container.log.dir=/opt/hadoop/logs/userlogs/application_1437828324746_0002/container_1437828324746_0002_01_000003 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@nn1:43354/user/CoarseGrainedScheduler 1 dn3 6 application_1437828324746_0002 1> /opt/hadoop/logs/userlogs/application_1437828324746_0002/container_1437828324746_0002_01_000003/stdout 2> /opt/hadoop/logs/userlogs/application_1437828324746_0002/container_1437828324746_0002_01_000003/stderr

我diabled纱线的参数启用yarn.nodemanager.pmem签,发现物理内存使用了直到40演唱会

I diabled YARN's parameter "yarn.nodemanager.pmem-check-enabled" and noticed that physical memory usage went till 40 gigs

我检查在/ proc / PID / smaps总的RSS,它是相同的值物理内存纱线报告和top命令看到。

I checked the total RSS in /proc/pid/smaps and it was same value as physical memory reported by Yarn and seen in top command.

我查了一下,它不是堆的东西,但一个问题是关堆/本地内存增加英寸我用像Visual VM的工具,但并没有发现任何的增加存在。 MaxDirectMmeory也没有超过600MB。活动线程的峰值数是70-80和线程栈的大小不超过100MB。 MetaspaceSize约为60-70MB。

I checked that its not a problem with the heap but something is increasing in off heap/ native memory. I used tools like Visual VM but didn't find anything that's increasing there. MaxDirectMmeory also didn't exceed 600MB. Peak number of active threads was 70-80 and thread stack size didn't exceed 100MB. MetaspaceSize was around 60-70MB.

仅供参考,我在星火1.2和Hadoop 2.4.0和我的火花应用是基于SQL的Spark,它是一个HDFS读/ Spark中写密集型应用程序和缓存数据的SQL的内存中缓存

FYI, I am on Spark 1.2 and Hadoop 2.4.0 and my spark application is based on Spark SQL and it's an HDFS read/write intensive application and caches data in Spark SQL's in-memory caching

任何帮助将是非常美联社preciated。或任何暗示,我应该在哪里查看调试内存泄漏或任何工具已经存在。让我知道是否需要任何其他信息。

Any help would be highly appreciated. Or any hint that where should I look to debug memory leak or if any tool already there. Let me know if any other information is needed.

推荐答案

最后,我才得以摆脱这个问题。问题是,在星火SQL的实木复合地板写入路径创建COM pressors没有得到回收利用,因此,我的执行人是为每一个木写文件全新的COM pressor(从本地内存),从而耗尽物理内存限制。

Finally I was able to get rid of the issue. The issue was that the compressors created in Spark SQL's parquet write path weren't getting recycled and hence, my executors were creating brand new compressor (from native memory) for every parquet write file and thus exhausting the physical memory limits.

我已经开了实木复合地板JIRA下面的错误,并提出了相同的PR: -

I had opened the following bug in PARQUET jira and have raised the PR for same :-

https://issues.apache.org/jira/browse/PARQUET-353

这固定在我结束内存问题。

This fixed the memory issue at my end.

P.S。 - 你会看到这个问题只有在写拼花密集应用

P.S. - You will see this problem only in a parquet write intensive app.

这篇关于不断增加物理内存的在纱线火花应用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆