为什么Spark失败并出现java.lang.OutOfMemoryError:超出了GC开销限制? [英] Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

查看:1281
本文介绍了为什么Spark失败并出现java.lang.OutOfMemoryError:超出了GC开销限制?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试实现一个以前在Spark中运行良好的Hadoop Map/Reduce作业. Spark应用程序的定义如下:

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following:

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

MyFunctions.combine在哪里

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

如果用于输入的列表很大,并且抛出异常,则combine函数会生成许多映射键.

The combine function produces lots of map keys if the list used for input is big and this is where the exceptions is thrown.

在Hadoop Map Reduce设置中,我没有遇到任何问题,因为这是combine函数产生的点,是Hadoop将映射对写入磁盘的点. Spark似乎将所有内容保留在内存中,直到它以java.lang.OutOfMemoryError: GC overhead limit exceeded爆炸为止.

In the Hadoop Map Reduce setting I didn't have problems because this is the point where the combine function yields was the point Hadoop wrote the map pairs to disk. Spark seems to keep all in memory until it explodes with a java.lang.OutOfMemoryError: GC overhead limit exceeded.

我可能确实在做一些基本的错误,但是我找不到任何有关如何从中提出建议的指针,我想知道如何避免这种情况.由于我是Scala和Spark的专家,所以我不确定问题是来自一个还是另一个.我目前正在尝试在自己的笔记本电脑上运行该程序,并且该程序适用于tuples数组长度不是很长的输入.预先感谢.

I am probably doing something really basic wrong but I couldn't find any pointers on how to come forward from this, I would like to know how I can avoid this. Since I am a total noob at Scala and Spark I am not sure if the problem is from one or from the other, or both. I am currently trying to run this program in my own laptop, and it works for inputs where the length of the tuples array is not very long. Thanks in advance.

推荐答案

调整内存可能是一种不错的方法,因为这是一个代价高昂的操作,很难进行扩展.但是也许一些代码更改会有所帮助.

Adjusting the memory is probably a good way to go, as has already been suggested, because this is an expensive operation that scales in an ugly way. But maybe some code changes will help.

您可以在您的Combine函数中采用另一种方法,通过使用combinations函数来避免使用if语句.在组合操作之前,我还将元组的第二个元素转换为双精度:

You could take a different approach in your combine function that avoids if statements by using the combinations function. I'd also convert the second element of the tuples to doubles before the combination operation:

tuples.

    // Convert to doubles only once
    map{ x=>
        (x._1, x._2.toDouble)
    }.

    // Take all pairwise combinations. Though this function
    // will not give self-pairs, which it looks like you might need
    combinations(2).

    // Your operation
    map{ x=>
        (toKey(x{0}._1, x{1}._1), x{0}._2*x{1}._2)
    }

这将提供一个迭代器,您可以在下游使用它,或者,如果需要,可以使用toList转换为列表(或其他内容).

This will give an iterator, which you can use downstream or, if you want, convert to list (or something) with toList.

这篇关于为什么Spark失败并出现java.lang.OutOfMemoryError:超出了GC开销限制?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆