为什么是“啜饮"?一个文件不是一个好习惯? [英] Why is "slurping" a file not a good practice?

查看:26
本文介绍了为什么是“啜饮"?一个文件不是一个好习惯?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

为什么slurping"一个文件对于普通的文本文件 I/O 不是一个好习惯,它什么时候有用?

Why is "slurping" a file not a good practice for normal text-file I/O, and when is it useful?

例如,为什么我不应该使用这些?

For example, why shouldn't I use these?

File.read('/path/to/text.txt').lines.each do |line|
  # do something with a line
end

File.readlines('/path/to/text.txt').each do |line|
  # do something with a line
end

推荐答案

我们一次又一次地看到有关读取文本文件以逐行处理的问题,这些问题使用了 read 的变体, 或 readlines,它在一个动作中将整个文件拉入内存.

Again and again we see questions asking about reading a text file to process it line-by-line, that use variations of read, or readlines, which pull the entire file into memory in one action.

read 说:

打开文件,可选地寻找给定的偏移量,然后返回长度字节(默认为文件的其余部分).[...]

Opens the file, optionally seeks to the given offset, then returns length bytes (defaulting to the rest of the file). [...]

readlines 说:

将 name 指定的整个文件读取为单独的行,并在数组中返回这些行.[...]

Reads the entire file specified by name as individual lines, and returns those lines in an array. [...]

拉入一个小文件没什么大不了的,但是随着传入数据的缓冲区的增长,内存必须被洗牌,这会占用 CPU 时间.此外,如果数据占用太多空间,操作系统必须介入以保持脚本运行并开始假脱机到磁盘,这将使程序瘫痪.在 HTTPd(网络主机)或需要快速响应的东西上,它会削弱整个应用程序.

Pulling in a small file is no big deal, but there comes a point where memory has to be shuffled around as the incoming data's buffer grows, and that eats CPU time. In addition, if the data consumes too much space, the OS has to get involved just to keep the script running and starts spooling to disk, which will take a program to its knees. On a HTTPd (web-host) or something needing fast response it'll cripple the entire application.

Slurping 通常是基于对文件 I/O 速度的误解,或者认为读取然后拆分缓冲区比一次读取一行更好.

Slurping is usually based on a misunderstanding of the speed of file I/O or thinking that it's better to read then split the buffer than it is to read it a single line at a time.

这里有一些测试代码来演示slurping"导致的问题.

Here's some test code to demonstrate the problem caused by "slurping".

将其另存为test.sh":

Save this as "test.sh":

echo Building test files...

yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000       > kb.txt
yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000000    > mb.txt
yes "abcdefghijklmnopqrstuvwxyz 123456890" | head -c 1000000000 > gb1.txt
cat gb1.txt gb1.txt > gb2.txt
cat gb1.txt gb2.txt > gb3.txt

echo Testing...

ruby -v

echo
for i in kb.txt mb.txt gb1.txt gb2.txt gb3.txt
do
  echo
  echo "Running: time ruby readlines.rb $i"
  time ruby readlines.rb $i
  echo '---------------------------------------'
  echo "Running: time ruby foreach.rb $i"
  time ruby foreach.rb $i
  echo
done

rm [km]b.txt gb[123].txt 

它会创建五个大小不断增加的文件.1K 文件很容易处理,而且很常见.过去认为 1MB 的文件很大,但现在很普遍.1GB 在我的环境中很常见,并且会定期遇到超过 10GB 的文件,因此了解 1GB 及以上会发生什么非常重要.

It creates five files of increasing sizes. 1K files are easily processed, and are very common. It used to be that 1MB files were considered big, but they're common now. 1GB is common in my environment, and files beyond 10GB are encountered periodically, so knowing what happens at 1GB and beyond is very important.

将其另存为readlines.rb".除了在内部逐行读取整个文件,然后将其附加到一个数组中,然后返回的数组中,它什么都不做,看起来它会很快,因为它都是用 C 编写的:

Save this as "readlines.rb". It doesn't do anything but read the entire file line-by-line internally, and append it to an array that is then returned, and seems like it'd be fast since it's all written in C:

lines = File.readlines(ARGV.shift).size
puts "#{ lines } lines read"

将其另存为foreach.rb":

Save this as "foreach.rb":

lines = 0
File.foreach(ARGV.shift) { |l| lines += 1 }
puts "#{ lines } lines read"

在我的笔记本电脑上运行 sh ./test.sh 我得到:

Running sh ./test.sh on my laptop I get:

Building test files...
Testing...
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0]

读取 1K 文件:

Running: time ruby readlines.rb kb.txt
28 lines read

real    0m0.998s
user    0m0.386s
sys 0m0.594s
---------------------------------------
Running: time ruby foreach.rb kb.txt
28 lines read

real    0m1.019s
user    0m0.395s
sys 0m0.616s

读取 1MB 文件:

Running: time ruby readlines.rb mb.txt
27028 lines read

real    0m1.021s
user    0m0.398s
sys 0m0.611s
---------------------------------------
Running: time ruby foreach.rb mb.txt
27028 lines read

real    0m0.990s
user    0m0.391s
sys 0m0.591s

读取 1GB 文件:

Running: time ruby readlines.rb gb1.txt
27027028 lines read

real    0m19.407s
user    0m17.134s
sys 0m2.262s
---------------------------------------
Running: time ruby foreach.rb gb1.txt
27027028 lines read

real    0m10.378s
user    0m9.472s
sys 0m0.898s

读取 2GB 文件:

Running: time ruby readlines.rb gb2.txt
54054055 lines read

real    0m58.904s
user    0m54.718s
sys 0m4.029s
---------------------------------------
Running: time ruby foreach.rb gb2.txt
54054055 lines read

real    0m19.992s
user    0m18.765s
sys 0m1.194s

读取 3GB 文件:

Running: time ruby readlines.rb gb3.txt
81081082 lines read

real    2m7.260s
user    1m57.410s
sys 0m7.007s
---------------------------------------
Running: time ruby foreach.rb gb3.txt
81081082 lines read

real    0m33.116s
user    0m30.790s
sys 0m2.134s

注意 readlines 的运行速度是如何随着文件大小的增加而减慢两倍,而使用 foreach 会线性减慢.在 1MB 时,我们可以看到有一些影响slurping"I/O 的东西不会影响逐行读取.而且,由于现在 1MB 文件非常普遍,因此很容易看出,如果我们不提前考虑,它们会在程序的整个生命周期内减慢文件的处理速度.如果它们发生一次,这里的几秒钟或没有太多,但如果它们每分钟发生多次,那么到年底就会对性能产生严重影响.

Notice how readlines runs twice as slow each time the file size increases, and using foreach slows linearly. At 1MB, we can see there's something affecting the "slurping" I/O that doesn't affect reading line-by-line. And, because 1MB files are very common these days, it's easy to see they'll slow the processing of files over the lifetime of a program if we don't think ahead. A couple seconds here or there aren't much when they happen once, but if they happen multiple times a minute it adds up to a serious performance impact by the end of a year.

我几年前在处理大型数据文件时遇到了这个问题.我使用的 Perl 代码会定期停止,因为它在加载文件时重新分配内存.重写代码以不吞咽数据文件,而是逐行读取和处理它,将速度从超过五分钟的运行速度大大提高到不到一分钟,并给了我一个很大的教训.

I ran into this problem years ago when processing large data files. The Perl code I was using would periodically stop as it reallocated memory while loading the file. Rewriting the code to not slurp the data file, and instead read and process it line-by-line, gave a huge speed improvement from over five minutes to run to less than one and taught me a big lesson.

"slurping" 文件有时很有用,特别是如果您必须跨行边界执行某些操作,但是,如果必须这样做,则值得花一些时间考虑读取文件的替代方法.例如,考虑维护一个由最后n"行构建的小缓冲区并对其进行扫描.这将避免因尝试读取和保存整个文件而导致的内存管理问题.这在与 Perl 相关的博客Perl Slurp-Eaze" 涵盖了何时"和为什么"以证明使用完整文件读取是合理的,并且非常适用于 Ruby.

"slurping" a file is sometimes useful, especially if you have to do something across line boundaries, however, it's worth spending some time thinking about alternate ways of reading a file if you have to do that. For instance, consider maintaining a small buffer built from the last "n" lines and scan it. That will avoid memory management issues caused by trying to read and hold the entire file. This is discussed in a Perl-related blog "Perl Slurp-Eaze" which covers the "whens" and "whys" to justify using full file-reads, and applies well to Ruby.

出于其他不啜饮"文件的绝佳理由,请阅读如何在文件文本中搜索模式并将其替换为给定值".

For other excellent reasons not to "slurp" your files, read "How to search file text for a pattern and replace it with a given value".

这篇关于为什么是“啜饮"?一个文件不是一个好习惯?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆