如何在Perl中设置文件读缓冲区大小来优化大文件? [英] How can I set the file-read buffer size in Perl to optimize it for large files?

查看:165
本文介绍了如何在Perl中设置文件读缓冲区大小来优化大文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,在阅读文件时,Java和Perl都很难找到一个适合所有默认缓冲区的大小,但是我发现它们的选择越来越陈旧了,并且在更改默认选择时遇到问题涉及到Perl。



对于Perl,我相信它使用 8K 默认情况下,类似于Java的选择,我找不到一个使用perldoc网站搜索引擎(真正的Google)的参考如何增加默认文件输入缓冲区大小,说64K。 / b>

从上面的链接可以看出8K缓冲区不能缩放:

lockquote

如果行通常每个都有大约60个字符,那么10,000行文件中就有大约610,000个字符。通过缓冲逐行读取文件只需要75个系统调用和75个等待磁盘,而不是10,001。

所以对于每行60个字符(包括最后一个换行符)的50,000,000行文件,使用8K缓冲区,将使366211系统调用读取2.8GiB文件。另外,您可以通过在任务管理器进程列表中查看磁盘I / O读取增量(至少在Windows中,* nix中的顶部显示相同的东西,我相信)来确认此行为,作为Perl程序需要10分钟才能读取文本文件:)

有人提出了关于增加perlmonks上的Perl输入缓冲区大小的问题,有人回答说这里可以增加$ /的大小,从而增加缓冲区大小,但是从perldoc:

lockquote
设置$ /为一个整数的引用,包含一个整数的标量或者可以转换为一个整数的标量将尝试读取记录而不是行,最大记录大小是被引用的整数。

所以我认为这实际上不会增加​​Perl使用的缓冲区大小当使用典型的:

时,从磁盘读取
$ b $

  while(<>){
#用$ _做某事
...
}

逐行成语。

现在可能是一个不同的一次读取一条记录,然后解析成行上面的代码版本会更快一般,并绕过标准习惯用法的基本问题,不能改变默认的缓冲区大小(如果这实际上是不可能的),因为你可以设置记录大小为任何你想要的,然后解析每个记录到单独的行,并希望Perl做了正确的事情,最终每个记录做一次系统调用,但是它增加了复杂性,而我真正想做的就是通过增加上面示例中使用的缓冲区来合理地获得简单的性能增益比如64K大小,甚至可以使用我的系统上的测试脚本,将缓冲区大小调整为最佳大小,以便长时间读取,而不需要额外的麻烦。



在Java中,只要直接支持增加缓冲区大小就行。在Java中,我相信当前的默认缓冲区java.io.BufferedReader使用的大小也是8192字节,尽管JDK文档中的最新引用是模棱两可的,例如,1.5文档只说:


可以指定缓冲区大小,或者可以接受默认大小。对于大多数情况,缺省值足够大。

幸运的是,您不必相信JDK开发人员做出正确的决定为你的应用程序,可以设置你自己的缓冲区大小(在这个例子中是64K):
$ b $ pre $ import java.io.BufferedReader;

reader = new BufferedReader(new InputStreamReader(fileInputStream,UTF-8),65536);
[...]
while(true){
String line = reader.readLine();
if(line == null){
break;

在这里做一些事情* /
foo(line);



$ b $ p
$ b一次只能分析一行,即使有一个巨大的缓冲区和现代化的硬件,我敢肯定有办法通过阅读大量的多行记录并将每个记录分解成一个记号,然后用这些记号进行处理每个记录,但它们增加了复杂性和边缘情况(虽然如果有一个在纯Java中的优雅解决方案(只使用JDK 1.5中提供的功能),这将是很酷的知道)。在Perl中增加缓冲区大小至少可以解决80%的Perl性能问题,同时保持事情的直接性。



我的问题是:



有没有一种方法可以在Perl中为上述典型的逐行方式调整缓冲区大小,类似于Java示例中缓冲区大小的增加情况?

解决方案

假设您在支持setvbuf的操作系统上运行,您可以影响缓冲。请参阅 IO :: Handle 的文档。如果您使用的是perl 5.10,则不必像在文档中那样显式创建一个IO :: Handle对象;所有的句柄都隐式地从那个版本开始就是IO :: Handles

  use 5.010; 
使用strict;
使用警告;

使用autodie;

使用IO :: Handle'_IOLBF';

打开我的$ handle,'<:utf8','foo';

my $ buffer;
$ handle-> setvbuf($ buffer,_IOLBF,0x10000);

while(my $ line =< $ handle>){
...
}


I understand that both Java and Perl try quite hard to find a one-size-fits all default buffer size when reading in files, but I find their choices to be increasingly antiquated, and am having a problem changing the default choice when it comes to Perl.

In the case of Perl, which I believe uses 8K buffers by default, similar to Java's choice, I can't find a reference using the perldoc website search engine (really Google) on how to increase the default file input buffer size to say, 64K.

From the above link, to show how 8K buffers don't scale:

If lines typically have about 60 characters each, then the 10,000-line file has about 610,000 characters in it. Reading the file line-by-line with buffering only requires 75 system calls and 75 waits for the disk, instead of 10,001.

So for a 50,000,000 line file with 60 characters per line (including the newline at the end), with an 8K buffer, it's going to make 366211 system calls to read a 2.8GiB file. As an aside, you can confirm this behaviour by looking at the disk i/o read delta (in Windows at least, top in *nix shows the same thing somehow too I'm sure) in the task manager process list as your Perl program takes 10 minutes to read in a text file :)

Someone asked the question about increasing the Perl input buffer size on perlmonks, someone replied here that you could increase the size of "$/", and thus increase the buffer size, however from the perldoc:

Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer.

So I assume that this does not actually increase the buffer size that Perl uses to read ahead from the disk when using the typical:

while(<>) {
    #do something with $_ here
    ...
}

"line-by-line" idiom.

Now it could be that a different "read a record at a time and then parse it into lines" version of the above code would be faster in general, and bypass the underlying problem with the standard idiom and not being able to change the default buffer size (if that's indeed impossible), because you could set the "record size" to anything you wanted and then parse each record into individual lines, and hope that Perl does the right thing and ends up doing one system call per record, but it adds complexity, and all I really want to do is get an easy performance gain by increasing the buffer used in the above example to a reasonably large size, say 64K, or even tuning that buffer size to the optimal size for long reads using a test script on my system, without needing extra hassle.

Things are much better in Java as far as straight-forward support for increasing the buffer size goes.

In Java, I believe the current default buffer size that java.io.BufferedReader uses is also 8192 bytes, although up-to-date references in the JDK docs are equivocal, e.g., the 1.5 docs say only:

The buffer size may be specified, or the default size may be accepted. The default is large enough for most purposes.

Luckily with Java you do not have to trust the JDK developers to have made the right decision for your application and can set your own buffer size (64K in this example):

import java.io.BufferedReader;
[...]
reader = new BufferedReader(new InputStreamReader(fileInputStream, "UTF-8"), 65536);
[...]
while (true) {
                String line = reader.readLine();
                if (line == null) {
                    break;
                }
                /* do something with the line here */
                foo(line);
}

There's only so much performance you can squeeze out of parsing one line at a time, even with a huge buffer, and modern hardware, and I'm sure there are ways to get every ounce of performance out of reading in a file by reading big many-line records and breaking each into tokens then doing stuff with those tokens once per record, but they add complexity and edge cases (although if there's an elegant solution in pure Java (only using the features present in JDK 1.5) that would be cool to know about). Increasing the buffer size in Perl would solve 80% of the performance problem for Perl at least, while keeping things straight-forward.

My question is:

Is there a way to adjust that buffer size in Perl for the above typical "line-by-line" idiom, similar how the buffer size was increased in the Java example?

解决方案

You can affect the buffering, assuming that you're running on an O/S that supports setvbuf. See the documentation for IO::Handle. You don't have to explicitly create an IO::Handle object as in the documentation if you're using perl 5.10; all handles are implicitly IO::Handles since that release.

use 5.010;
use strict;
use warnings;

use autodie;

use IO::Handle '_IOLBF';

open my $handle, '<:utf8', 'foo';

my $buffer;
$handle->setvbuf($buffer, _IOLBF, 0x10000);

while ( my $line = <$handle> ) {
    ...
}

这篇关于如何在Perl中设置文件读缓冲区大小来优化大文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆