提取大输入文件时触发OutOfMemoryError [英] Spark OutOfMemoryError when taking a big input file

查看：44 发布时间：2021/4/8 19:48:37 apache-spark

本文介绍了提取大输入文件时触发OutOfMemoryError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个spark应用程序，该应用程序读取具有1亿行的文件(每行都有一个代码，例如 US1.234.567B1 )，并从中获取一些模式，如下所示:

I have a spark application that reads a file with 100 million lines (each line has a code, such as US1.234.567B1) and gets some patterns out of it, as follows:

  val codes = sc.textFile("/data/codes.txt")

  def getPattern(code: String) = code.replaceAll("\\d", "d")

  val patterns: RDD[(String, Int)] = codes
    .groupBy(getPattern)
    .mapValues(_.size)
    .sortBy(- _._2)

  patterns
    .map { case (pattern, size) => s"$size\t$pattern" }
    .saveAsTextFile("/tmp/patterns")

我在master = local [*]上运行它，它失败，并出现 java.lang.OutOfMemoryError:超出了GC开销限制.

I am running this on master=local[*], and it fails with java.lang.OutOfMemoryError: GC overhead limit exceeded.

那是为什么?

我认为Spark可以处理任何大小的输入，只要它具有足够的硬盘空间即可.

I thought that Spark can handle any size of input, as long as it has enough hard disk space.

推荐答案

很长一段时间，您尝试使用Spark反模式:

Long short you're trying to use Spark anti-pattern:

.groupBy(getPattern)
.mapValues(_.size)

可以很容易地表示为例如:

that can be easily expressed for example as:

codes.keyBy(getPattern).mapValues(_ => 1L).reduceByKey(_ + _).sortBy(_._2, false)

我认为Spark可以处理任何大小的输入.

I thought that Spark can handle any size of input.

只要您不做不到，它通常就可以扩展.RDD上的 group / groupByKey 为每个密钥创建本地集合.所有这些都没有存储在单个执行器的内存中.

It usually can scale out as long as you don't make it impossible. group / groupByKey on RDDs create local collections for each key. Each of these hast to in the memory of a single executor.

这篇关于提取大输入文件时触发OutOfMemoryError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

提取大输入文件时触发OutOfMemoryError [英] Spark OutOfMemoryError when taking a big input file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

提取大输入文件时触发OutOfMemoryError [英] Spark OutOfMemoryError when taking a big input file

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭