斯卡拉immutable地图缓慢 [英] Scala immutable Map slow

查看:135
本文介绍了斯卡拉immutable地图缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当我创建一个地图时,我有一段代码:

  val map = gtfLineArr(8).split (k,v)=>(k,v)} .toMap 

然后我使用这张图创建我的对象:

  case class MyObject(val attribute1:String,val attribute2:Map [String:String])

我是读取数以百万计的行并使用迭代器转换为MyObjects类似于
$ b $ pre $ MyObject(1,map)

当我这样做的时候,真的很慢,超过1'000'000'000元。



我从对象创建中删除映射,但仍然执行拆分过程(第1节):

  val map = gtfLineArr(8).split(;)。map(_ split\)。collect {case Array(k,v)=> (k,v)} .toMap 
MyObject(1,null)

脚本在不到1分钟内运行的过程。为2000'000亿条目。



我对某些分析进行了剖析,看起来像是在 val map 到对象映射正在使进程变慢。我错过了什么?



更好地解释问题:

如果您看到我的代码解释我的自我遍历200多行将每行转换为内部对象,迭代我:

  it.map(cretateNewObject ).toList 

这个迭代器遍历所有行,并使用函数<$将其转换为我的对象C $ C> createNewObject 。

这实际上非常快,特别是使用大内存,因为dk14说。性能问题在我的

 `crateNewObject(val line:String)`

这个函数创建一个对象

 `class MyObject (val attribute1:String,val attribute2:Map [String,String])`

my function take该行并做第一个

 `val attributeArr = line.split(\t)`
code>

数组的第一个属性记录是我的对象的属性1,第二个属性是

 `val map = attributeArr(8).split(;)。map(_ split\)。collect {case Array(k,v )=>(k,v)} .toMap` 

如果我只打印元素的数量在地图中,程序以2分钟结束,如果我将地图传递给我的新对象行 MyObject(attribute1,map),程序真的很慢。

解决方案

(0 to 2000000).toList and (0到2000000).map(x => x - > x).toMap 具有相似的性能,如果你给他们足够的内存(我试过-Xmx4G - 4千兆字节)。 toMap 实现有很多关于克隆的内容,所以很多内存正在被分配/解除分配。所以,在内存不足的情况下GC会变得过于活跃。



当我试图运行(0到2000000).toList 与128Mb - 花了几秒钟,但(0到2000000).map(x => x - > x).toMap 至少需要2分钟10%GC活动(VisualVM),并死于内存不足。

然而,当我尝试 -Xmx4G 都很快。






PS什么 toMap 不会重复添加元素到前缀树,所以它必须克隆( Array.copy )a每个元素都有很多: https: //github.com/scala/scala/blob/99a82be91cbb85239f70508f6695c6b21fd3558c/src/library/scala/collection/immutable/HashMap.scala#L321



所以, toMap 重复执行(2000000次) updated0 ,然后执行数组.copy 很常见,这需要大量的内存分配,这在大多数情况下(在低内存的情况下)会导致GC进入MarkAndSweep(慢垃圾回收)(据我所知,从jconsole )。






解决方案:是否增加内存( -Xmx / -Xms JVM参数),或者如果您需要对数据集进行更复杂的操作,请使用Apache Spark(或任何面向批处理的map-reduce框架)以分布式方式处理数据。


I have a piece of code when I create a map like:

 val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap

Then I use this map to create my object:

case class MyObject(val attribute1: String, val attribute2: Map[String:String]) 

I'm reading millions of lines and converting to MyObjects using an iterator. Like

MyObject("1", map)

When I do it is really slow, more than 1h for 2'000'000 entries.

I remove the map from the object creation, but still I do the split process (section 1):

val map = gtfLineArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap
MyObject("1", null)

And the process the script run in less than 1 min. for the 2'000'000 millions entries.

I di'd some profiling and looks like is when the object is created the assignment between the val map to the object map is making the process slow. What I' missing?

Update to explain better the problem:

If you see my code the to explain my self iterate over 2000000 lines converting each line to an internal objet, to iterate I do:

it.map(cretateNewObject).toList

this iterator iterate through all the lines and convert them to my objects using the function createNewObject.

This is actually really fast, specially using big memory as dk14 said. The performance problem is inside my

`crateNewObject(val line:String)` 

this function create an object

`class MyObject(val attribute1:String, val attribute2:Map[String, String])` 

the my function take the line and do first

`val attributeArr = line.split("\t")` 

the first attribute record of the array is the attribute1 of my object and the second attribute is

`val map = attributeArr(8).split(";").map(_ split "\"").collect { case Array(k, v) => (k, v) }.toMap` 

if I only print the number of elements in map the programs end in 2 min, if I pass map to my new object line MyObject(attribute1, map) the program is really slow.

解决方案

(0 to 2000000).toList and (0 to 2000000).map(x => x -> x).toMap have similar performance if you give them enough memory (I tried -Xmx4G - 4 Gigabytes). toMap implementation is a lot about cloning, so a lot of memory is being "allocated"/"deallocated". So, in case of memory starvation GC is becoming overactive.

When I tried to run (0 to 2000000).toList with 128Mb - it took several seconds, but (0 to 2000000).map(x => x -> x).toMap took at least 2 minutes with 10% GC activity (VisualVM), and died with out of memory.

However, when I tried -Xmx4G both were pretty fast.


P.S. What toMap does is repeatedly adding an element to a prefix tree, so it has to clone (Array.copy) a lot per every element: https://github.com/scala/scala/blob/99a82be91cbb85239f70508f6695c6b21fd3558c/src/library/scala/collection/immutable/HashMap.scala#L321.

So, toMap is repeatedly (2000000 times) doing updated0, which in its turn doing an Array.copy pretty often, which requires lots of memory allocations, which (in low-memory case) causes GC to go MarkAndSweep (slow garbage collection) most of the time (as far as I can see from jconsole).


Solution: Whether increase the memory (-Xmx/-Xms JVM parameters) or if you need more complex operations on your data-set use something like Apache Spark (or any batch-oriented map-reduce framework) to process your data in a distributed way.

这篇关于斯卡拉immutable地图缓慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆