Perl读取了一个大文件,用于多行正则表达式 [英] Perl read a large file for use with multi line regex

查看:118
本文介绍了Perl读取了一个大文件,用于多行正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个长度可变的4GB文本文件,这只是一个示例文件,生产文件会更大.我需要阅读文件并应用多行正则表达式.

I have a 4GB text file with highly variable length lines, this is only a sample file, production files will be much larger. I need to read the file and apply a multi line regex.

读取多行正则表达式的大文件的最佳方法是什么?

What is the best way to read such a large file for the multi line regex?

如果我逐行阅读它,我认为我的多行正则表达式不能正常工作.当我以3个参数形式使用read函数时,我的正则表达式结果会随着我更改在read语句中指定的长度大小而变化.我相信文件的大小使其太大而无法读取到数组或内存中.

If I read it line by line, I don't think my multi line regex will work correctly. When I use the read function in 3 argument form my regex results vary as I change the size of length I specify in the the read statement. I believe that the file's size makes it too large to be read into an array or into memory.

这是我的代码

package main;
use strict;
use warnings;

our $VERSION = 1.01;
my $buffer;
my $INFILE;
my $OUTFILE;

open $INFILE, '<', ... or die "Bad Input File: $!";
open $OUTFILE, '>',... or die "Bad Output File: $!";

while ( read $INFILE, $buffer, 512  ) {
    if ($buffer =~ /(?m)(^[^\r\n]*\R+){1}^(B|BREAK|C|CLOSE|D|DO(?! NOT)|E|ELSE|F|FOR|G|GOTO|H|HALT|HANG|I|IF|J|JOB|K|KILL|L|LOCK|M|MERGE|N|O|OPEN|Q|QUIT|R|READ|S|SET|TC|TRE|TRO|TS|U|USE|V|VIEW|W|WRITE|X|XECUTE)( |:).*[^\r\n]/) {
        print $OUTFILE $&;
        print $OUTFILE "\n";
    }
}

close( $INFILE ); 
close( $OUTFILE );
1;

以下是一些示例数据:

^%Z("EUD")
S %L=%LO,%N="E1"
^%Z("RT")
This is data that I don't want the regex to find
^%Z("EXY")
X ^%Z("EW2"),^%Z("ELONG"):$L(%L)>245 S %N="E1" Q:$L(%L)>255  X ^%ZOSF("EON") S DX=0,DY=%EY,X=%RM+1 X ^%ZOSF("RM"),XY K %EX,%EY,%E1,%E2,DX,DY,%N Q
^%Z("F12")
S %A=$P(^DIC(9.8,0),"^",3)+1,%C=$P(^(0),"^",4)+1 X "F %=0:0 Q:'$D(^DIC(9.8,%A,0))  S %A=%A+1" S $P(^DIC(9.8,0),"^",3,4)=%A_"^"_%C,^DIC(9.8,%A,0)=%X_"^R",^DIC(9.8,"B",%X,%A)=""
^%Z("F2")
S %=$H>21549+$H-.1,%Y=%\365.25+141,%=%#365.25\1,%D=%+306#(%Y#4=0+365)#153#61#31+1,%M=%-%D\29+1,%DT=%Y_"00"+%M_"00"+%D,%D=%M_"/"_%D_"/"_$E(%Y,2,3)

上面的行在语法上是成对的(第1行和第2行在一起,第3行和第4行等等).我需要在上述数据中找到特定的对,除了:

The lines above are paired, syntactically (line 1 and 2 go together, 3 and 4, etc). I need to find specific pairs, in the above data that's all of the pairs except for:

^%Z("RT")
This is data that I don't want the regex to find

推荐答案

问题显然与解析页面之外模块本文.找出最佳方法确实是第一步.

The question is apparently about parsing a DSL, and it seems that in general regex isn't the right tool for that. A quick search did not yield an easy list of accepted approaches, except for pages of CPAN modules and posts like this article. Finding out the best approach is indeed the first step.

但是,下面是标题和明确说明中所提问题的答案:如何解析一个非常大的文件,其中要处理的单位分布在未知的行数上.

However, below is an answer to the question as stated in the title and in the clear description: how to parse a very large file where units to be processed spread over an unknown number of lines.

继续组装缓冲区"并进行检查.找到匹配项后,进行处理并清除它.

Keep assembling a 'buffer' and checking it. Once you find a match, process and clear it.

例如,将一行插入变量并检查(如果使用正则表达式,则尝试匹配).继续操作,直到它与过程匹配并清除变量.

For instance, appeand a line to a variable and check (try to match if you use regex). Keep going and once it does match process and clear the variable.

my $unit;
while (<$fh>) {
    # chomp;            # if suitable, and then add a space
    # $unit .= ' '.$_;  # as a separator that newline was
    $unit .= $_;

    if ( test_unit($unit) ) {
         # process ...
         $unit = undef;
    }
}

test_unit()子是代码的占位符,该代码将决定是否应处理组装好的单元.如果是正则表达式,则可以在循环my $re = qr/.../;之前进行定义(请参见 perr中的qr ),然后使用if ($unit =~ $re)

The test_unit() sub is a placeholder for code that would decide whether the assembled unit should be processed. If that is regex it can be defined before the loop, my $re = qr/.../; (see qr in perlop), and then test in the loop with if ($unit =~ $re)

问题中的一条注释指出要处理的行成对出现,但在注释中得到澄清,随后的行并不总是成对出现.因此,我们无法处理线对.

A note in the question states that lines to be processed come in pairs, but it is clarificated in a comment that subsequent lines don't always pair up. Thus we can't process pairs of lines.

这篇关于Perl读取了一个大文件,用于多行正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆