perl 多行正则表达式来分隔段落中的注释 [英] perl multiline regex to separate comments within paragraphs

查看:76
本文介绍了perl 多行正则表达式来分隔段落中的注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的脚本可以工作,但它需要一些杂耍.通过kludge"我的意思是一行代码让脚本做我想做的事——但我不明白为什么需要这行代码.显然,我不完全理解多行正则表达式替换(以 /mg 结尾)在做什么.

The script below works, but it requires a kludge. By "kludge" I mean a line of code which makes the script do what I want --- but I do not understand why the line is necessary. Evidently, I do not understand exactly what the multiline regex substitution, ending /mg, is doing.

难道没有更优雅的方式来完成任务吗?

Is there not a more elegant way to accomplish the task?

脚本逐段读取文件.它将每个段落分成两个子集:$text$cmnt.$text 包括每一行的左边部分,即从第一列到第一个 %(如果存在)或到行尾(如果不存在)'不.$cmnt 包括其余部分.

The script reads through a file by paragraphs. It partitions each paragraph into two subsets: $text and $cmnt. The $text includes the left part of every line, i.e., from the first column up to the first %, if it exists, or to end of the line if it doesn't. The $cmnt includes the rest.

动机:要读取的文件是 LaTeX 标记,其中 % 表示注释的开始.如果我们正在阅读 perl 脚本,我们可以将 $breaker 的值更改为等于 #.将 $text$cmnt 分开后,可以执行跨行匹配,例如

Motivation: The files to be read are LaTeX markup, where % announces the beginning of a comment. We could change the value of $breaker to equal # if we were reading through a perl script. After separating $text from $cmnt, one could perform a match across lines such as

print "match" if ($text =~ /WOLF\s*DOG/s);

请参阅标有kludge"的行.如果没有那一行,在记录中最后一个 % 之后会发生一些有趣的事情.如果有 $text 行(材料未被 % 注释掉)在记录的最后注释行之后,这些行包含在 $cmnt 的末尾和 $text.

Please see the line labeled "kludge." Without that line, something funny happens after the last % in a record. If there are lines of $text (material not commented out by %) after the last commented line of the record, those lines are included both at the end of $cmnt and in $text.

在下面的示例中,这意味着没有 kludge,在记录 2 中,cat lion"既包含在它所属的 $text 中,也包含在 $cmnt 中.

In the example below, this means that without the kludge, in record 2, "cat lion" is included both in the $text, where it belongs, and also in $cmnt.

(kludge 导致不必要的 % 出现在每个非空 $cmnt 的末尾.这是因为 kludge-pasted-on % 宣布最后一个虚构的空注释行.)

(The kludge causes an unnecessary % to appear at the end of every non-void $cmnt. This is because the kludge-pasted-on % announces a final, fictitious empty comment line.)

根据 https://perldoc.perl.org/perlre.html#Modifiers/m 正则表达式修饰符表示

According to https://perldoc.perl.org/perlre.html#Modifiers, the /m regex modifier means

将匹配的字符串视为多行.即,改变^"和$"从匹配字符串第一行的开头和最后一行的结尾到匹配字符串中每一行的开头和结尾.

Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.

因此,我预计第二场比赛在

Therefore, I would expect the 2nd match in

s/^([^$breaker]*)($breaker.*?)$/$2/mg

从第一个 % 开始,延伸到行尾,然后停止.所以即使没有kludge,也不应该包括猫狮".在记录 2 中?但显然它确实如此,所以我误读或遗漏了文档的某些部分.我怀疑它与 /g 正则表达式修饰符有关?

to start with the first %, to extend as far of end-of-line, and stop there. So even without the kludge, it should not include the "cat lion" in record 2? But obviously it does, so I am misreading, or missing, some part of the documentation. I suspect it has to do with the /g regex modifier?

#!/usr/bin/perl
use strict; use warnings;
my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++; 
    my $text = $_; 
    my $cmnt;
    s/[\n]*\z/$breaker/; # kludge
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)  # non-greedy
    {
        $cmnt    = $_; 
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);  # non-greedy
    }
    else
    {
        $cmnt    = ''; 
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

运行它的示例文件:

dog wolf % flea 
DOG WOLF % FLEA 
DOG WOLLLLLLF % FLLLLLLEA 


% what was that?
 cat lion


no comments in this line




%The last paragraph of this file is nothing but a single-line comment.

推荐答案

您还必须从 $cmnt 中删除不包含注释的行:

You must also delete the lines that does not contain a comment from $cmnt:

use feature qw(say);
use strict;
use warnings;

my $count_record = 0;
my $breaker = '%';
$/ = ''; # one paragraph at a time
while(<>)
{
    $count_record++;
    my $text = $_;
    my $cmnt;
    s/[\n]*\z/\n/; # guarantee each record ends with exactly one newline==LF==linefeed
    if ($text =~ s/^([^$breaker]*)($breaker.*?)$/$1/mg)  # non-greedy
    {
        $cmnt    = $_;
        $cmnt =~ s/^[^$breaker]*?$//mg;
        die "cmnt does not match" unless ($cmnt =~ s/^([^$breaker]*)($breaker.*?)$/$2/mg);  # non-greedy
    }
    else
    {
        $cmnt    = '';
    }
    print "\nRECORD $count_record:\n";
    print "******** text==";
    print "\n|";
    print $text;
    print "|\n";
    print "******** cmnt==|";
    print $cmnt;
    print "|\n";
}

输出:

RECORD 1:
******** text==
|dog wolf 
DOG WOLF 
DOG WOLLLLLLF 

|
******** cmnt==|% flea 
% FLEA 
% FLLLLLLEA 
|

RECORD 2:
******** text==
|
 cat lion

|
******** cmnt==|% what was that?

|

RECORD 3:
******** text==
|no comments in this line

|
******** cmnt==||

RECORD 4:
******** text==
||
******** cmnt==|%The last paragraph of this file is nothing but a single-line comment.
|

这篇关于perl 多行正则表达式来分隔段落中的注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆