在Perl中匹配多行格式不正确的文本 [英] Matching multiple lines of poorly formatted text in Perl

查看:87
本文介绍了在Perl中匹配多行格式不正确的文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据格式来自外部程序,如下所示,需要获取每行的前4个字段(文本,用户名,数字和时间戳).请注意,Hello line1是一个字段,第二个是用户名.输出的格式可以是单行(如下面的line1)或三行(如下面的line2)或两行(如下面的line4).而且格式也可以像下面这样混合(不是总是单行,也不是双行等等)

I have data format coming like below from an external program and need to get the first 4 fields(Text, username, number and timestamp) of each line. Please note Hello line1 is one field and second one is user name. The format is output could be single line like line1 below or three lines like line2 or two lines like line4 below. And also the format can be mixed like below(not single line always or double etc)

Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                         Line2FirstName-LastName       8       7/17/2015 1:15 PM 

Line2Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

Hello Line4

                         Line4FirstName-LastName       8       9/17/2015 1:20 PM

上面在编辑器中的屏幕截图

借助以下问题,我获得了Multline RegEx:前3个单个项目的Perl多行正则表达式

I was able to get Multline RegEx with the help of this question: Perl multiline regex for first 3 individual items

感谢@GsusRecovery!

Thanks to @GsusRecovery!

由于我正在逐行读取输出,所以我认为我无法通过读取单行来利用多行RegEx.如果格式为一行,则只能读取一行;如果格式为2行或3行,则可以读取2行?

Since i am reading line by line output i don't think i can take advantage of the multi line RegEx by reading singe line. Is it possible to read only single line if the format is in one line or read 2 lines if it is spread out in 2 or 3 lines if it is spread out in 3 lines?

还是最好根据每行或三行格式读取每行和回溯.

Or is it only better to read each and every line and backtrack depending on double line or triple line format.

请提出建议.

推荐答案

更新:我已经更改了脚本以接受stdin并将其作为数组放置在@output_lines中(以模拟输入情况) @sureng)

UPDATE: i've changed the script to accept stdin and put it in @output_lines as array (to emulate the input situation of @sureng)

我将正则表达式包装在一个将小时作为结束模式的累加器中.这样,您可以逐行解析输出,然后应用正则表达式.

I've wrapped the regex in a line accumulator that recognize the hour as a closing pattern. In this way you can parse the output line by line and yet apply the regex.

#!/usr/bin/perl

use strict;
use warnings;

my ($accumulator,$chat,$username,$chars,$timestamp);

my @output_lines = <STDIN>;

foreach (@output_lines)
{
    $accumulator .= $_;

   ($chat,$username,$chars,$timestamp) = $accumulator =~ m/(?im)^\s*(.+)\s+(\w+[-,\.]\w+)\s+(\d+)\s+([0-1]?\d\/[0-3]?\d\/[1-2]\d{3}\s+[0-2]?\d:[0-5]?\d\s?[ap]m)\s*$/;
    $chat =~ s/\s+$// if $chat;  #remove trailing spaces

    if ( $accumulator =~ /(?i)([0-2]?\d:[0-5]?\d\s?[ap]m)/ ) {
        print "SECTION matched\n";
        print "-"x80,"\n";
        print "$accumulator";
        print "-"x80,"\n";
        print "chat -> ${chat}\n";
        print "username -> ${username}\n";
        print "chars -> ${chars}\n";
        print "timestamp -> ${timestamp}\n\n";
        $accumulator = '';  # reset the line accumulator
    }
}

在线尝试解决方案(示例以stdin形式提供)此处.

Try the solution online (with your example provided as stdin) here.

在您的外壳程序中,给出上面的脚本和此输入文件:

In your shell, given the script above and this input file:

# MultiLineInput.txt
Hello Line1 FirstName.LastName 10 3/23/2011 2:46 PM

Hello Line2

                     Line2FirstName-LastName       8       7/17/2015 1:15 PM 
Line2Testing - 12323232323 Hello There

Hello Line3 Line3FirstName.LastName 8 3/21/2011 2:46 PM

Hello Line4

                     Line4FirstName-LastName       8       9/17/2015 1:20 PM

您可以简单地致电:

cat MultiLineInput.txt | StreamRegex.pl

如果它能按预期工作,则可以用源代码替换cat命令.

If it works as expected you can substitute the cat command with your source.

NB :如果您处理流或文件大于系统的易失性内存(因此您想将其作为流处理),则需要这种方法. ,无论如何都可以使用.

NB: this approach is needed if you process a stream or if your file is bigger than the volatile memory of the system (and so you want to process it as a stream) but, that said, it works in any case.

这篇关于在Perl中匹配多行格式不正确的文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆