从Perl中的文本文件读入时跳过标题的最佳方法是什么? [英] Best way to skip a header when reading in from a text file in Perl?

查看:183
本文介绍了从Perl中的文本文件读入时跳过标题的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从Perl中的标签描述文件中抓取了几列。该文件的第一行与其他行完全不同,所以我想尽可能快速有效地跳过该行。

I'm grabbing a few columns from a tab delineated file in Perl. The first line of the file is completely different from the other lines, so I'd like to skip that line as fast and efficiently as possible.

这就是我所拥有的至今。

This is what I have so far.

my $firstLine = 1;

while (<INFILE>){
    if($firstLine){
        $firstLine = 0;
    }
    else{
        my @columns = split (/\t+/);
        print OUTFILE "$columns[0]\t\t$columns[1]\t$columns[2]\t$columns[3]\t$columns[11]\t$columns[12]\t$columns[15]\t$columns[20]\t$columns[21]\n";
    }
}

有没有更好的方法来做到这一点,也许没有$ firstLine中?或者有没有办法直接从第2行开始阅读INFILE?

Is there a better way to do this, perhaps without $firstLine? OR is there a way to start reading INFILE from line 2 directly?

提前致谢!

推荐答案

让我们得到一些数据。我对每个人的技术进行了基准测试...

Let's get some data on this. I benchmarked everybody's techniques...

#!/usr/bin/env perl

sub flag_in_loop {
    my $file = shift;

    open my $fh, $file;

    my $first = 1;
    while(<$fh>) {
        if( $first ) {
            $first = 0;
        }
        else {
            my $line = $_;
        }
    }

    return;
}

sub strip_before_loop {
    my $file = shift;

    open my $fh, $file;

    my $header = <$fh>;
    while(<$fh>) {
        my $line = $_;
    }

    return;
}

sub line_number_in_loop {
    my $file = shift;

    open my $fh, $file;

    while(<$fh>) {
        next if $. < 2;

        my $line = $_;
    }

    return;
}

sub inc_in_loop {
    my $file = shift;

    open my $fh, $file;

    my $first;
    while(<$fh>) {
        $first++ or next;

        my $line = $_;
    }

    return;
}

sub slurp_to_array {
    my $file = shift;

    open my $fh, $file;

    my @array = <$fh>;
    shift @array;

    return;
}


my $Test_File = "/usr/share/dict/words";
print `wc $Test_File`;

use Benchmark;

timethese shift || -10, {
    flag_in_loop        => sub { flag_in_loop($Test_File); },
    strip_before_loop   => sub { strip_before_loop($Test_File); },
    line_number_in_loop => sub { line_number_in_loop($Test_File); },
    inc_in_loop         => sub { inc_in_loop($Test_File); },
    slurp_to_array      => sub { slurp_to_array($Test_File); },
};

因为这是I / O,可能受到Benchmark.pm调整能力以外的力量的影响因为,我跑了几次并检查我得到了相同的结果。

Since this is I/O which can be affected by forces beyond the ability of Benchmark.pm to adjust for, I ran them several times and checked I got the same results.

/ usr / share / dict / words 是一个2.4兆字节的文件,大约有240k非常短的行。由于我们没有处理线,因此线长度无关紧要。

/usr/share/dict/words is a 2.4 meg file with about 240k very short lines. Since we're not processing the lines, line length shouldn't matter.

我在每个例程中只做了很少的工作来强调技术之间的差异。我想做一些工作,以便通过改变你阅读文件的方式,为你将获得或失去的性能产生一个现实的上限。

I only did a tiny amount of work in each routine to emphasize the difference between the techniques. I wanted to do some work so as to produce a realistic upper limit on how much performance you're going to gain or lose by changing how you read files.

我是在带有SSD的笔记本电脑上做到这一点,但它仍然是笔记本电脑。随着I / O速度的增加,CPU时间变得更加重要。技术在具有快速I / O的机器上更为重要。

I did this on a laptop with an SSD, but its still a laptop. As I/O speed increases, CPU time becomes more significant. Technique is even more important on a machine with fast I/O.

这是每个例程每秒读取文件的次数。

Here's how many times each routine read the file per second.

slurp_to_array:       4.5/s
line_number_in_loop: 13.0/s
inc_in_loop:         15.5/s
flag_in_loop:        15.8/s
strip_before_loop:   19.9/s

我很震惊地发现 my @ array =< $ fh> 的速度最慢。考虑到perl解释器中发生的所有工作,我原以为它会是最快的。但是,它是唯一一个分配内存以容纳所有行并且可能导致性能滞后的人。

I'm shocked to find that my @array = <$fh> is slowest by a huge margin. I would have thought it would be the fastest given all the work is happening inside the perl interpreter. However, it's the only one which allocates memory to hold all the lines and that probably accounts for the performance lag.

使用 $。是另一个惊喜。也许这是访问魔术全局的成本,或者可能是它进行数字比较。

Using $. is another surprise. Perhaps that's the cost of accessing a magic global, or perhaps its doing a numeric comparison.

并且,正如算法分析预测的那样,将头部检查代码放在循环之外是最快的。但不是很多。可能还不足以担心你是否正在使用下两个最快的。

And, as predicted by algorithmic analysis, putting the header check code outside the loop is the fastest. But not by much. Probably not enough to worry about if you're using the next two fastest.

这篇关于从Perl中的文本文件读入时跳过标题的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆