Scala可迭代的内存泄漏 [英] Scala Iterable Memory Leaks

查看:131
本文介绍了Scala可迭代的内存泄漏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始与Scala一起玩,并且遇到了以下问题.下面是4种不同的方法来遍历文件的各行,执行一些操作并将结果写入另一个文件.这些方法中的一些可以按照我的想法工作(尽管要使用大量内存),而有些则吃不完.

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.

这个想法是将Scala的getLines Iterator包装为Iterable.我不在乎它是否多次读取文件-这就是我期望的那样.

The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.

这是我的复制代码:

class FileIterable(file: java.io.File) extends Iterable[String] {
  override def iterator = io.Source.fromFile(file).getLines
}

// Iterator

// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines

// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator

// Iterable

// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable

// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)

def values = lines
      .drop(1)
      //.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
      //.filter(l => l.startsWith("*"))

val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()

正在读取的文件约为10GB,行数为1MB.

The file it's reading is ~10GB with 1MB lines.

前两个选项使用恒定的内存量(〜100MB)来迭代文件.这就是我所期望的.这里的缺点是,迭代器只能使用一次,并且它将Scala的按名称调用约定用作psuedo-iterable. (作为参考,等效的c#代码使用〜14MB)

The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)

第三个方法调用TraverableOnce中定义的ItItable.这个可以工作,但是要使用大约2GB的空间来完成相同的工作.不知道内存要去哪里,因为它无法缓存整个Iterable.

The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.

第四个是最令人担忧的-它立即使用所有可用内存并引发OOM异常.甚至更奇怪的是,它对我测试过的所有操作都执行了此操作:放置,映射和过滤.看一下实现,它们似乎都没有保持很多状态(尽管下降看起来有点可疑-为什么它不仅仅计算项目?).如果我不进行任何操作,则效果很好.

The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.

我的猜测是,在某个地方它维护着对所读取的每一行的引用,尽管我无法想象.在Scala中传递Iterables时,我看到了相同的内存使用情况.例如,如果我采用情况3(.toIterable),并将其传递给将Iterable [String]写入文件的方法,我会看到同样的爆炸.

My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.

有什么想法吗?

推荐答案

请注意 Iterable 的ScalaDoc说:

Note how the ScalaDoc of Iterable says:

此特征的实现需要提供一种具体的方法 签名:

Implementations of this trait need to provide a concrete method with signature:

  def iterator: Iterator[A]

他们还需要提供创建生成器的方法newBuilder 用于同类收藏.

They also need to provide a method newBuilder which creates a builder for collections of the same kind.

由于您没有提供newBuilder的实现,因此将获得默认的实现,该实现使用ListBuffer并因此尝试将所有内容都放入内存中.

Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.

您可能希望将Iterable.drop实现为

def drop(n: Int) = iterator.drop(n).toIterable

,但这会与集合库的表示形式不变性相冲突(即,iterator.toIterable返回Stream,而您希望List.drop返回List等-因此需要Builder概念)

but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).

这篇关于Scala可迭代的内存泄漏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆