Scala快速文本文件读取并上传到内存 [英] Scala fast text file read and upload to memory

查看:937
本文介绍了Scala快速文本文件读取并上传到内存的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Scala中,为了阅读文本文件并将其上传到数组中,常见的方法是

  scala.io。 Source.fromFile(file.txt)。getLines.toArray 

特别是对于非常大的文件,有没有一种更快的方法可能先将字节块读入内存然后再用新的行字符拆分它们? (有关常用方法,请参阅阅读Scala中的整个文件。)



非常感谢。

解决方案

性能问题与数据读取方式无关。它已经缓冲了。在实际迭代之前没有任何事情发生:

  //测量封闭代码所花费的时间
def timed [A ](块:=> A)= {
val t0 = System.currentTimeMillis
val result = block
println(take+(System.currentTimeMillis - t0)+ms )
结果
}

val source = timed(scala.io.Source.fromFile(test.txt))// 200mb,500行
/ / take 0ms

val lines = timed(source.getLines)
//花了0ms

timed(lines.next)//读第一行
//花了1ms

// ...重置源...

var x = 0
timed(lines.foreach(ln => x) + = ln.length))//使用每一行
//花了421ms

// ...重置源...

timed( lines.toArray)
//花了915ms

考虑到每秒500mb的读取速度对于我的硬盘,200mb的最佳时间是400ms,这意味着除了不是c之外没有改进的余地将迭代器转换为数组。



根据您的应用程序,您可以考虑直接使用迭代器而不是数组。因为在内存中使用如此庞大的阵列肯定会成为性能问题。






编辑:从你的评论我假设,你想要进一步转换数组(也许你将线条分成列,就像你说你正在读数字数组一样)。在这种情况下,我建议在阅读时进行转换。例如:

  source.getLines.map(_。split(,)。map(_。trim.toInt) ).toArray 



<$快得多p $ p> source.getLines.toArray.map(_。split(,)。map(_。trim.toInt))

(对我来说它是1.9s而不是2.5s)
因为你没有将整个巨型阵列转换成另一个但只是单独的每一行,结束在一个单独的数组中(仅使用一半的堆空间)。此外,由于读取文件是一个瓶颈,因此在读取时进行转换会带来更好的CPU利用率。


In Scala, for reading a text file and uploading it into an array, a common approach is

scala.io.Source.fromFile("file.txt").getLines.toArray

Especially for very large files, is there a faster approach perhaps by reading blocks of bytes into memory first and then splitting them by new line characters ? (See Read entire file in Scala for commonly used approaches.)

Many Thanks.

解决方案

The performance problem has nothing to do with the way the data is read. It is already buffered. Nothing happens until you actually iterate through the lines:

// measures time taken by enclosed code
def timed[A](block: => A) = {
  val t0 = System.currentTimeMillis
  val result = block
  println("took " + (System.currentTimeMillis - t0) + "ms")
  result
}

val source = timed(scala.io.Source.fromFile("test.txt")) // 200mb, 500 lines
// took 0ms

val lines = timed(source.getLines)
// took 0ms

timed(lines.next) // read first line
// took 1ms

// ... reset source ...

var x = 0
timed(lines.foreach(ln => x += ln.length)) // "use" every line
// took 421ms

// ... reset source ...

timed(lines.toArray)
// took 915ms

Considering a read-speed of 500mb per second for my hard drive, the optimum time would be at 400ms for the 200mb, which means that there is no room for improvements other than not converting the iterator to an array.

Depending on your application you could consider using the iterator directly instead of an Array. Because working with such a huge array in memory will definitely be a performance issue anyway.


Edit: From your comments I assume, that you want to further transform the array (Maybe split the lines into columns as you said you are reading a numeric array). In that case I recommend to do the transformation while reading. For example:

source.getLines.map(_.split(",").map(_.trim.toInt)).toArray

is considerably faster than

source.getLines.toArray.map(_.split(",").map(_.trim.toInt))

(For me it is 1.9s instead of 2.5s) because you don't transform an entire giant array into another but just each line individually, ending up in one single array (Uses only half the heap space). Also since reading the file is a bottleneck, transforming while reading has the benefit that it results in better CPU utilization.

这篇关于Scala快速文本文件读取并上传到内存的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆