perl 多行字符串正则表达式 [英] perl multiline string regex
问题描述
我尝试通过逐行读取文件来查找文件中的所有字符串(在 " 或 ' 之间).
I try to find all the strings (between " or ') in a file by reading the file line by line.
my @strings = ();
open FILE, $file or die "File operation failed: $!";
foreach my $line (<FILE>) {
push(@strings, $1) if /(['"].*['"])/g;
}
close FILE;
问题是此代码仅适用于单行的字符串.
The problem is this code work only for strings on a single line.
print "single line string";
但我还必须匹配多行字符串,例如:
But I have to match also multiline strings like :
print "This is a
multiligne
string";
我该怎么办?
顺便说一下,我知道我的正则表达式不够好.因为它应该匹配以 " 开头并以 " 结尾的字符串(与单引号相同)但如果我们有 "not correct string'
By the way, I know my regex isn't good enough. Because it should match strings that start with " and finish with " (same with single quotes) but not if we have "not correct string'
更新:我的新代码是
my @strings = ();
open FILE, $file or die "File operation failed: $!";
local $/;
foreach my $line (<FILE>) {
push(@strings, grep { defined and /["']/ } quotewords('\s+', 1, $_));
}
close FILE;
但是如果数据是:
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
我应该得到:
"single line \n"
"This is a
multiline
string"
'single quote string'
"string with variable "
" after variable"
推荐答案
以下是用于解析单引号或双引号的两个正则表达式.请注意,为了能够捕获多行字符串,我已经提取了所有数据:
The following are two regex's for parsing either single or double quotes. Note, that I've slurped all the data in order to be able to catch multiline strings:
use strict;
use warnings;
my $squo_re = qr{'(?:(?>[^'\\]*)|\\.)*'};
my $dquo_re = qr{"(?:(?>[^"\\]*)|\\.)*"};
my $data = do {local $/; <DATA>};
while ($data =~ /($squo_re|$dquo_re)/g) {
print "<$1>\n";
}
__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
但是,因为您正在尝试解析 perl 代码,所以最简洁的方法是使用 PPI
虽然:
However, because you're trying to parse perl code, the cleanest way of doing it will be to use PPI
though:
use strict;
use warnings;
use PPI;
my $src = do {local $/; <DATA>};
# Load a document
my $doc = PPI::Document->new( \$src );
# Find all the barewords within the doc
my $strings = $doc->find( 'PPI::Token::Quote' );
for (@$strings) {
print '<', $_->content, ">\n";
}
__DATA__
print $time . "single line \n";
print "This is a
multiline
string";
print 'single quote string';
print "string with variable ".$time." after variable";
两种方法输出:
<"single line \n">
<"This is a
multiline
string">
<'single quote string'>
<"string with variable ">
<" after variable">
更新 (?> ... )
以下是双引号正则表达式的注释版本.
The following is an annotated version of the double quote regular expression.
my $dquo_re = qr{
"
(?: # Non-capturing group - http://perldoc.perl.org/perlretut.html#Non-capturing-groupings
(?> # Independent Subexpression to prevent backtracking (this is for efficiency only) - http://perldoc.perl.org/perlretut.html#Using-independent-subexpressions-to-prevent-backtracking
[^"\\]* # All characters NOT a " or \
)
|
\\. # Backslash followed by any escaped character
)* # Any number of the preceeding or'd group
"
}x;
独立子表达式 (?> ... )
这个正则表达式实际上不需要它工作.它旨在防止回溯,因为只有一种方法可以匹配带引号的字符串,我们要么使用上述规则找到结尾引号,要么不找到.
The independent subexpression (?> ... )
it not actually required for this regex to work. It is intended to prevent backtracking because there is only one way for a quoted string to match, either we find a ending quote using the above rules or we don't.
子表达式在处理递归正则表达式时更有用,但我一直在这种情况下使用它.我必须稍后进行基准测试,以确定这是否真的只是过早的优化.
The subexpression is a lot more useful when dealing with a recursive regex, but I've always used it in this case. I'll have to benchmark at a later to to decide if it's actually just a premature optimization.
评论更新
为了避免评论,您可以使用我已经提出的 PPI
解决方案.它用于解析 perl 代码并且已经可以正常工作.
To avoid comments, you can just use the PPI
solution that I already proposed. It's meant to parse perl code and will already work as it is.
但是,鉴于这是一项实验室作业,正则表达式解决方案是在循环中设置第二个捕获组以查找注释:
However, given this is a lab assignment, a regex solution would be to setup a second capturing group in your loop for finding comments:
while ($data =~ /($squo_re|$dquo_re)|($comment_re)/g) {
my $quote = $1,
my $comment = $2;
if (defined $quote) {
print "<$quote>\n";
} elsif ($defined $comment) {
print "Comment - $comment\n";
}
}
以上将匹配带引号的字符串或注释.将定义实际匹配的捕获,以便您知道找到了哪个.不过,您必须想出正则表达式才能找到自己的评论.
The above will match either a quoted string or a comment. Which capture actually matched will be defined so you can know which was found. You will have to come up with the regular expression for finding a comment on your own though.
这篇关于perl 多行字符串正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!