正则表达式中的空行和空行之间的差异 [英] Differences between empty and blank lines in regexps
问题描述
已经有几个好SO上正则表达式和空行的讨论.如果有重复的问题,我将删除.
There are already several good discussions of regular expressions and empty lines on SO. I'll remove this question if it is a duplicate.
任何人都可以解释为什么此脚本输出5 3 4 5 4 3
而不是4 3 4 4 4 3
吗?在调试器$blank
和$classyblank
中运行它时,它会一直保持在"4"(我认为这是正确的值),直到在print语句之前.
Can anyone explain why this script outputs 5 3 4 5 4 3
instead of 4 3 4 4 4 3
? When I run it in the debugger $blank
and $classyblank
stay at "4" (which I assume is the correct value) until the just before the print statement.
my ( $blank, $nonblank, $non_nonblank,
$classyblank, $classyspace, $blanketyblank ) = 0 ;
while (<DATA>) {
$blank++ if /\p{IsBlank}/ ; # POSIXly blank - 4?
$nonblank++ if /^\P{IsBlank}$/ ; # POSIXly non-blank - 3
$non_nonblank++ if not /\S/ ; # perlishly not non-blank - 4
$classyblank++ if /[[:blank:]]/ ; # older(?) charclass blankness - 4?
$classyspace++ if /^[[:space:]]$/ ; # older(?) charclass whitespace - 4
$blanketyblank++ if /^$/ ; # perlishly *really empty* - 3
}
print join " ", $blank, $nonblank, $non_nonblank,
$classyblank, $classyspace, $blanketyblank , "\n" ;
__DATA__
line above only has a linefeed this one is not blank because: words
this line is followed by a line with white space (you may need to add it)
then another blank line following this one
THE END :-\
与__DATA__
部分有关还是我误解了POSIX正则表达式?
Is it something to do with the __DATA__
section or am I misunderstanding POSIX regular expressions?
ps:
如对及时发布的帖子在其他地方的评论中所述,真的空着"(/^$/
)可能会错过-空虚:
As noted in comment on a timely post elsewhere, "really empty" (/^$/
) can miss non-emptiness:
perl -E 'my $string = "\n" . "foo\n\n" ; say "empty" if $string =~ /^$/ ;'
perl -E 'my $string = "\n" . "bar\n\n" ; say "empty" if $string =~ /\A\z/ ;'
perl -E 'my $string = "\n" . "baz\n\n" ; say "empty" if $string =~ /\S/ ;'
推荐答案
/\p{IsBlank}/
不检查空字符串. \p
匹配具有指定Unicode属性的字符.
/\p{IsBlank}/
doesn't check for a empty string. \p
matches a character that has the specified Unicode property.
$ unichars '\p{IsBlank}' | cat
---- U+0009 CHARACTER TABULATION
---- U+0020 SPACE
---- U+00A0 NO-BREAK SPACE
---- U+1680 OGHAM SPACE MARK
---- U+2000 EN QUAD
---- U+2001 EM QUAD
---- U+2002 EN SPACE
---- U+2003 EM SPACE
---- U+2004 THREE-PER-EM SPACE
---- U+2005 FOUR-PER-EM SPACE
---- U+2006 SIX-PER-EM SPACE
---- U+2007 FIGURE SPACE
---- U+2008 PUNCTUATION SPACE
---- U+2009 THIN SPACE
---- U+200A HAIR SPACE
---- U+202F NARROW NO-BREAK SPACE
---- U+205F MEDIUM MATHEMATICAL SPACE
---- U+3000 IDEOGRAPHIC SPACE
它与" \n"
匹配,因为SPACE具有IsBlank属性.
It matches " \n"
since SPACE has the IsBlank property.
/[[:blank:]]/
不检查空字符串. [...]
匹配作为指定类成员的字符.
/[[:blank:]]/
doesn't check for a empty string. [...]
matches a character that is a member of the specified class.
$ unichars '[[:blank:]]' | cat
---- U+0009 CHARACTER TABULATION
---- U+0020 SPACE
---- U+00A0 NO-BREAK SPACE
---- U+1680 OGHAM SPACE MARK
---- U+2000 EN QUAD
---- U+2001 EM QUAD
---- U+2002 EN SPACE
---- U+2003 EM SPACE
---- U+2004 THREE-PER-EM SPACE
---- U+2005 FOUR-PER-EM SPACE
---- U+2006 SIX-PER-EM SPACE
---- U+2007 FIGURE SPACE
---- U+2008 PUNCTUATION SPACE
---- U+2009 THIN SPACE
---- U+200A HAIR SPACE
---- U+202F NARROW NO-BREAK SPACE
---- U+205F MEDIUM MATHEMATICAL SPACE
---- U+3000 IDEOGRAPHIC SPACE
它与" \n"
匹配,因为SPACE是[:blank:]
POSIX字符类的成员,因此也是[[:blank:]]
字符类的成员.
It matches " \n"
since SPACE is a member of the [:blank:]
POSIX character class and thus a member of the [[:blank:]]
character class.
这篇关于正则表达式中的空行和空行之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!