Spark java.lang.OutOfMemoryError: Java 堆空间 [英] Spark java.lang.OutOfMemoryError: Java heap space

查看:30
本文介绍了Spark java.lang.OutOfMemoryError: Java 堆空间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的集群:1个master,11个slave,每个节点有6GB内存.

My cluster: 1 master, 11 slaves, each node has 6 GB memory.

我的设置:

spark.executor.memory=4g, Dspark.akka.frameSize=512

问题来了:

首先,我从 HDFS 读取一些数据(2.19 GB)到 RDD:

First, I read some data (2.19 GB) from HDFS to RDD:

val imageBundleRDD = sc.newAPIHadoopFile(...)

第二,在这个RDD上做点什么:

Second, do something on this RDD:

val res = imageBundleRDD.map(data => {
                               val desPoints = threeDReconstruction(data._2, bg)
                                 (data._1, desPoints)
                             })

最后,输出到 HDFS:

res.saveAsNewAPIHadoopFile(...)

当我运行我的程序时,它显示:

When I run my program it shows:

.....
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:24 as TID 33 on executor 9: Salve7.Hadoop (NODE_LOCAL)
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:24 as 30618515 bytes in 210 ms
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:36 as TID 34 on executor 2: Salve11.Hadoop (NODE_LOCAL)
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:36 as 30618515 bytes in 449 ms
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL)
Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[spark]
java.lang.OutOfMemoryError: Java heap space

任务太多?

PS:输入数据大约为 225 MB 时,一切正常.

PS: Every thing is ok when the input data is about 225 MB.

我该如何解决这个问题?

How can I solve this problem?

推荐答案

我有几点建议:

  • 如果您的节点配置为 Spark 最大 6g(并为其他进程保留一点),则使用 6g 而不是 4g,spark.executor.memory=6g.通过检查用户界面确保使用尽可能多的内存(它会显示您使用了多少内存)
  • 尝试使用更多分区,每个 CPU 应该有 2 - 4 个.IME 增加分区数通常是使程序更稳定(通常更快)的最简单方法.对于大量数据,每个 CPU 可能需要超过 4 个分区,在某些情况下,我不得不使用 8000 个分区!
  • 使用spark.storage.memoryFraction 减少为缓存保留的内存部分.如果你的代码中没有使用 cache()persist ,这也可能是 0.它的默认值是 0.6,这意味着你只能获得 0.4 * 4g 内存为你的堆.IME 减少 mem frac 通常会使 OOM 消失.更新:从 spark 1.6 开始,我们显然不再需要处理这些值,spark 会自动确定它们.
  • 与上面类似,但随机播放内存部分.如果您的工作不需要太多 shuffle 内存,则将其设置为较低的值(这可能会导致您的 shuffle 溢出到磁盘,这会对速度产生灾难性的影响).有时,当它是 OOMing 的 shuffle 操作时,您需要做相反的事情,即将它设置为较大的值,例如 0.8,或确保允许您的 shuffle 溢出到磁盘(这是自 1.0.0 以来的默认设置).
  • 注意内存泄漏,这通常是由于意外关闭了 lambda 表达式中不需要的对象造成的.诊断的方法是寻找任务序列化为 XXX 字节".在日志中,如果XXX大于几k或大于MB,则可能存在内存泄漏.请参阅 https://stackoverflow.com/a/25270600/1586965
  • 与上述有关;如果您确实需要大对象,请使用广播变量.
  • 如果您正在缓存大型 RDD 并且可以牺牲一些访问时间,请考虑序列化 RDD http://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage.或者甚至将它们缓存在磁盘上(如果使用 SSD,这有时并没有那么糟糕).
  • (Advanced) 与上述相关,避免 String 和严重嵌套的结构(如 Map 和嵌套 case 类).如果可能,请尝试仅使用原始类型并索引所有非原始类型,尤其是在您预计会有很多重复项的情况下.尽可能在嵌套结构上选择 WrappedArray.或者甚至推出您自己的序列化 - 您将获得有关如何有效地将数据备份为字节的最多信息,使用它
  • (bit hacky) 同样在缓存时,请考虑使用 Dataset 来缓存您的结构,因为它将使用更有效的序列化.与之前的要点相比,这应该被视为一种黑客攻击.将您的领域知识构建到您的算法/序列化中可以将内存/缓存空间最小化 100 倍或 1000 倍,而所有 Dataset 可能会提供 2x - 5x 的内存和 10x 压缩(实木复合地板)在磁盘上.
  • If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g, spark.executor.memory=6g. Make sure you're using as much memory as possible by checking the UI (it will say how much mem you're using)
  • Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster). For huge amounts of data you may need way more than 4 per CPU, I've had to use 8000 partitions in some cases!
  • Decrease the fraction of memory reserved for caching, using spark.storage.memoryFraction. If you don't use cache() or persist in your code, this might as well be 0. It's default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.
  • Similar to above but shuffle memory fraction. If your job doesn't need much shuffle memory then set it to a lower value (this might cause your shuffles to spill to disk which can have catastrophic impact on speed). Sometimes when it's a shuffle operation that's OOMing you need to do the opposite i.e. set it to something large, like 0.8, or make sure you allow your shuffles to spill to disk (it's the default since 1.0.0).
  • Watch out for memory leaks, these are often caused by accidentally closing over objects you don't need in your lambdas. The way to diagnose is to look out for the "task serialized as XXX bytes" in the logs, if XXX is larger than a few k or more than an MB, you may have a memory leak. See https://stackoverflow.com/a/25270600/1586965
  • Related to above; use broadcast variables if you really do need large objects.
  • If you are caching large RDDs and can sacrifice some access time consider serialising the RDD http://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage. Or even caching them on disk (which sometimes isn't that bad if using SSDs).
  • (Advanced) Related to above, avoid String and heavily nested structures (like Map and nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. Choose WrappedArray over nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT!
  • (bit hacky) Again when caching, consider using a Dataset to cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all a Dataset will likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.

http://spark.apache.org/docs/1.2.1/配置.html

编辑:(这样我就可以更轻松地用谷歌搜索自己)以下也表明了这个问题:

EDIT: (So I can google myself easier) The following is also indicative of this problem:

java.lang.OutOfMemoryError : GC overhead limit exceeded

这篇关于Spark java.lang.OutOfMemoryError: Java 堆空间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆