提取大输入文件时触发OutOfMemoryError [英] Spark OutOfMemoryError when taking a big input file
问题描述
我有一个spark应用程序,该应用程序读取具有1亿行的文件(每行都有一个代码,例如 US1.234.567B1
),并从中获取一些模式,如下所示:>
I have a spark application that reads a file with 100 million lines (each line has a code, such as US1.234.567B1
) and gets some patterns out of it, as follows:
val codes = sc.textFile("/data/codes.txt")
def getPattern(code: String) = code.replaceAll("\\d", "d")
val patterns: RDD[(String, Int)] = codes
.groupBy(getPattern)
.mapValues(_.size)
.sortBy(- _._2)
patterns
.map { case (pattern, size) => s"$size\t$pattern" }
.saveAsTextFile("/tmp/patterns")
我在master = local [*]上运行它,它失败,并出现 java.lang.OutOfMemoryError:超出了GC开销限制
.
I am running this on master=local[*], and it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded
.
那是为什么?
我认为Spark可以处理任何大小的输入,只要它具有足够的硬盘空间即可.
I thought that Spark can handle any size of input, as long as it has enough hard disk space.
推荐答案
很长一段时间,您尝试使用Spark反模式:
Long short you're trying to use Spark anti-pattern:
.groupBy(getPattern)
.mapValues(_.size)
可以很容易地表示为例如:
that can be easily expressed for example as:
codes.keyBy(getPattern).mapValues(_ => 1L).reduceByKey(_ + _).sortBy(_._2, false)
我认为Spark可以处理任何大小的输入.
I thought that Spark can handle any size of input.
只要您不做不到,它通常就可以扩展.RDD上的 group
/ groupByKey
为每个密钥创建本地集合.所有这些都没有存储在单个执行器的内存中.
It usually can scale out as long as you don't make it impossible. group
/ groupByKey
on RDDs create local collections for each key. Each of these hast to in the memory of a single executor.
这篇关于提取大输入文件时触发OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!