perl 多行字符串正则表达式 [英] perl multiline string regex

查看:78
本文介绍了perl 多行字符串正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试通过逐行读取文件来查找文件中的所有字符串(在 " 或 ' 之间).

I try to find all the strings (between " or ') in a file by reading the file line by line.

my @strings = ();
open FILE, $file or die "File operation failed: $!";
foreach my $line (<FILE>) {
    push(@strings, $1) if /(['"].*['"])/g;
}
close FILE;

问题是此代码仅适用于单行的字符串.

The problem is this code work only for strings on a single line.

print "single line string";   

但我还必须匹配多行字符串,例如:

But I have to match also multiline strings like :

print "This is a
multiligne
string";

我该怎么办?

顺便说一下,我知道我的正则表达式不够好.因为它应该匹配以 " 开头并以 " 结尾的字符串(与单引号相同)但如果我们有 "not correct string'

By the way, I know my regex isn't good enough. Because it should match strings that start with " and finish with " (same with single quotes) but not if we have "not correct string'

更新:我的新代码是

my @strings = ();
open FILE, $file or die "File operation failed: $!";
local $/;
foreach my $line (<FILE>) {
    push(@strings, grep { defined and /["']/ } quotewords('\s+', 1, $_));
}
close FILE;

但是如果数据是:

print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

我应该得到:

"single line \n"
"This is a
multiline
string"
'single quote string'
"string with variable "
" after variable"

推荐答案

以下是用于解析单引号或双引号的两个正则表达式.请注意,为了能够捕获多行字符串,我已经提取了所有数据:

The following are two regex's for parsing either single or double quotes. Note, that I've slurped all the data in order to be able to catch multiline strings:

use strict;
use warnings;

my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};

my $data = do {local $/; <DATA>};

while ($data =~ /($squo_re|$dquo_re)/g) {
    print "<$1>\n";
}

__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

但是,因为您正在尝试解析 perl 代码,所以最简洁的方法是使用 PPI 虽然:

However, because you're trying to parse perl code, the cleanest way of doing it will be to use PPI though:

use strict;
use warnings;

use PPI;

my $src = do {local $/; <DATA>};

# Load a document
my $doc = PPI::Document->new( \$src );

# Find all the barewords within the doc
my $strings = $doc->find( 'PPI::Token::Quote' );
for (@$strings) {
    print '<', $_->content, ">\n";
}

__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";

两种方法输出:

<"single line \n">
<"This is a
multiline
string">
<'single quote string'>
<"string with variable ">
<" after variable">

更新 (?> ... )

以下是双引号正则表达式的注释版本.

The following is an annotated version of the double quote regular expression.

my $dquo_re = qr{
    "
        (?:                # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
            (?>            # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
                [^"\\]*    # All characters NOT a " or \
            )
        |
            \\.            # Backslash followed by any escaped character
        )*                 # Any number of the preceeding or'd group
    "
    }x;

独立子表达式 (?> ... ) 这个正则表达式实际上不需要它工作.它旨在防止回溯,因为只有一种方法可以匹配带引号的字符串,我们要么使用上述规则找到结尾引号,要么不找到.

The independent subexpression (?> ... ) it not actually required for this regex to work. It is intended to prevent backtracking because there is only one way for a quoted string to match, either we find a ending quote using the above rules or we don't.

子表达式在处理递归正则表达式时更有用,但我一直在这种情况下使用它.我必须稍后进行基准测试,以确定这是否真的只是过早的优化.

The subexpression is a lot more useful when dealing with a recursive regex, but I've always used it in this case. I'll have to benchmark at a later to to decide if it's actually just a premature optimization.

评论更新

为了避免评论,您可以使用我已经提出的 PPI 解决方案.它用于解析 perl 代码并且已经可以正常工作.

To avoid comments, you can just use the PPI solution that I already proposed. It's meant to parse perl code and will already work as it is.

但是,鉴于这是一项实验室作业,正则表达式解决方案是在循环中设置第二个捕获组以查找注释:

However, given this is a lab assignment, a regex solution would be to setup a second capturing group in your loop for finding comments:

while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
    my $quote = $1,
    my $comment = $2;

    if (defined $quote) {
        print "<$quote>\n";
    } elsif ($defined $comment) {
        print "Comment - $comment\n";
    }
}

以上将匹配带引号的字符串或注释.将定义实际匹配的捕获,以便您知道找到了哪个.不过,您必须想出正则表达式才能找到自己的评论.

The above will match either a quoted string or a comment. Which capture actually matched will be defined so you can know which was found. You will have to come up with the regular expression for finding a comment on your own though.

这篇关于perl 多行字符串正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆