提取大输入文件时触发OutOfMemoryError [英] Spark OutOfMemoryError when taking a big input file

查看:44
本文介绍了提取大输入文件时触发OutOfMemoryError的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个spark应用程序,该应用程序读取具有1亿行的文件(每行都有一个代码,例如 US1.234.567B1 ),并从中获取一些模式,如下所示:

I have a spark application that reads a file with 100 million lines (each line has a code, such as US1.234.567B1) and gets some patterns out of it, as follows:

  val codes = sc.textFile("/data/codes.txt")

  def getPattern(code: String) = code.replaceAll("\\d", "d")

  val patterns: RDD[(String, Int)] = codes
    .groupBy(getPattern)
    .mapValues(_.size)
    .sortBy(- _._2)

  patterns
    .map { case (pattern, size) => s"$size\t$pattern" }
    .saveAsTextFile("/tmp/patterns")

我在master = local [*]上运行它,它失败,并出现 java.lang.OutOfMemoryError:超出了GC开销限制.

I am running this on master=local[*], and it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded.

那是为什么?

我认为Spark可以处理任何大小的输入,只要它具有足够的硬盘空间即可.

I thought that Spark can handle any size of input, as long as it has enough hard disk space.

推荐答案

很长一段时间,您尝试使用Spark反模式:

Long short you're trying to use Spark anti-pattern:

.groupBy(getPattern)
.mapValues(_.size)

可以很容易地表示为例如:

that can be easily expressed for example as:

codes.keyBy(getPattern).mapValues(_ => 1L).reduceByKey(_ + _).sortBy(_._2, false)

我认为Spark可以处理任何大小的输入.

I thought that Spark can handle any size of input.

只要您不做不到,它通常就可以扩展.RDD上的 group / groupByKey 为每个密钥创建本地集合.所有这些都没有存储在单个执行器的内存中.

It usually can scale out as long as you don't make it impossible. group / groupByKey on RDDs create local collections for each key. Each of these hast to in the memory of a single executor.

这篇关于提取大输入文件时触发OutOfMemoryError的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆