星火GROUPBY内存不足的困境 [英] Spark groupBy OutOfMemory woes

查看:214
本文介绍了星火GROUPBY内存不足的困境的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个相当小的数据做一个简单的GROUPBY(80文件HDFS,总共几音乐会)。我在一家纱线集群上8低内存的机器上运行星火,即沿着线的东西:

I'm doing a simple groupBy on a fairly small dataset (80 files in HDFS, few gigs in total). I'm running Spark on 8 low-memory machines in a yarn cluster, i.e. something along the lines of:

火花提交... --master纱客户--num-执行人8 --executor内存3000米--executor-芯1

该数据集由长度为500-2000的字符串。

The dataset consists of strings of length 500-2000.

我试图做一个简单的 groupByKey (见下文),但它无法用 java.lang.OutOfMemoryError:GC开销限制超过例外

I'm trying to do a simple groupByKey (see below), but it fails with a java.lang.OutOfMemoryError: GC overhead limit exceeded exception

val keyvals = sc.newAPIHadoopFile("hdfs://...")
  .map( someobj.produceKeyValTuple )
keyvals.groupByKey().count()

我可以使用 reduceByKey 没有问题,确保自己的问题不是由一个单一的过大引起的群体,也不是群体过量的计数组大小:

I can count the group sizes using reduceByKey without problems, ensuring myself the problem isn't caused by a single excessively large group, nor by an excessive amount of groups :

keyvals.map(s => (s._1, 1)).reduceByKey((a,b) => a+b).collect().foreach(println)
// produces:
//  (key1,139368)
//  (key2,35335)
//  (key3,392744)
//  ...
//  (key13,197941)

我试过重新格式化,重新洗牌,增加并行GROUPBY级别:

I've tried reformatting, reshuffling and increasing the groupBy level of parallelism:

keyvals.groupByKey(24).count // fails
keyvals.groupByKey(3000).count // fails
keyvals.coalesce(24, true).groupByKey(24).count // fails
keyvals.coalesce(3000, true).groupByKey(3000).count // fails
keyvals.coalesce(24, false).groupByKey(24).count // fails
keyvals.coalesce(3000, false).groupByKey(3000).count // fails

我试着 spark.default.parallelism 玩耍,并增加 spark.shuffle.memoryFraction 0.8 ,同时降低 spark.storage.memoryFraction 0.1

I've tried playing around with spark.default.parallelism, and increasing spark.shuffle.memoryFraction to 0.8 while lowering spark.storage.memoryFraction to 0.1

有故障的阶段(计数)将于3000任务2999失败。

The failing stage (count) will fail on task 2999 of 3000.

我似乎无法找到任何暗示GROUPBY不应该只是溢出到磁盘,而不是保持事物的记忆,但我只是无法得到它的工作权利,甚至在相当小的数据集。这应该obviosuly不是这样的,我一定是做错了什么,但我不知道从哪里开始调试这个!

I can't seem to find anything that suggests that groupBy shouldn't just spill to disk instead of keeping things in memory, but I just can't get it to work right, even on fairly small datasets. This should obviosuly not be the case, and I must be doing something wrong, but I have no idea where to start debugging this!

推荐答案

帕特里克·温德尔洒在GROUPBY运营商的一些细节指示灯<一个href=\"http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-td11427.html#a11487\">on邮件列表。外卖消息如下:

Patrick Wendell shed some light on the details of the groupBy operator on the mailing list. The takeaway message is the following:

在分区中的东西将波及[...]这个溢出只能出现的跨键的的时刻。溢出不能在present的关键之内发生。 [...]对于GROUPBY的很可能是星火的下一个版本落得一键内溢出,星火1.2。 [...]如果目标是从字面上只写到磁盘与每个组相关联的所有值,并配有单组关联的值比适合在内存较大,这可能无法立即与GROUPBY运营商来完成。

Within a partition things will spill [...] This spilling can only occur across keys at the moment. Spilling cannot occur within a key at present. [...] Spilling within one key for GroupBy's is likely to end up in the next release of Spark, Spark 1.2. [...] If the goal is literally to just write out to disk all the values associated with each group, and the values associated with a single group are larger than fit in memory, this cannot be accomplished right now with the groupBy operator.

他进一步提出一个变通办法:

He further suggests a work-around:

要解决这一点的最好方法取决于你正在尝试与下游的数据做了一下。典型的方法涉及子分割的非常大的群体,例如,附加在小范围内(1-10)大键散列值。然后,你的下游code的处理每一组聚合部分值。如果你的目标仅仅是各组布置在磁盘上的顺序上一个大文件,你可以叫 sortByKey 用散列后缀为好。排序功能是外在的星火1.1(这是在pre-释放)。

The best way to work around this depends a bit on what you are trying to do with the data down stream. Typically approaches involve sub-dividing any very large groups, for instance, appending a hashed value in a small range (1-10) to large keys. Then your downstream code has to deal with aggregating partial values for each group. If your goal is just to lay each group out sequentially on disk on one big file, you can call sortByKey with a hashed suffix as well. The sort functions are externalized in Spark 1.1 (which is in pre-release).

这篇关于星火GROUPBY内存不足的困境的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆