使用Unicode字符属性时如何模拟单词边界? [英] How to emulate word boundary when using unicode character properties?
问题描述
我之前的问题中为什么在locale-pragma字字符下和如何更改嵌套引号,我知道处理UTF-8数据时,您不能将\w
视为word-char,并且必须使用Unicode字符属性\p{Word}
.现在,我发现零宽度字边界\b
也不适用于UTF-8(启用了语言环境),但是在Unicode字符属性中找不到任何等效项.我以为自己可以像这样构造它:(?<=\P{Word})(\p{Word}+)(?=\P{Word})
,它应该等效于\b(\w+)\b
.
From my previous questions Why under locale-pragma word characters do not match? and How to change nested quotes I learnt that when dealing with UTF-8 data you can't trust \w
as word-char and you must use the Unicode character property \p{Word}
. Now I am in a situation where I found that zero-width word boundary \b
also does not work with UTF-8 (with locale enabled), but I did not find any equivalent in Unicode character properties. I thought I may construct it myself like: (?<=\P{Word})(\p{Word}+)(?=\P{Word})
, it should be equivalent to \b(\w+)\b
.
在下面的测试脚本中,我有两个数组来测试两个不同的正则表达式.如果未启用语言环境,则第一个基于\b
的方法可以正常工作.为了使其也能与语言环境一起使用,我编写了另一个模拟边界(?=\P{Word})
的版本,但它没有按我预期的那样工作(我也在脚本中显示了预期的结果).
In the test script below I have two arrays to test two different regexes. The first based on \b
works fine when locale is not enabled. To get it to also work with locales I wrote another version with emulating boundary (?=\P{Word})
but it does not work as I expected (I show expected results in script too).
您是否发现问题所在以及如何首先使用ASCII(或不使用语言环境)来模拟正则表达式?
Do you see what is wrong and how to get emulated regex work as first with ASCII (or without locale)?
#!/usr/bin/perl
use 5.010;
use utf8::all;
use locale; # et_EE.UTF-8 in my case
$| = 1;
my @test_boundary = ( # EXPECTED RESULT:
'"abc def"', # '«abc def»'
'"abc "d e f" ghi"', # '«abc «d e f» ghi»'
'"abc "d e f""', # '«abc «d e f»»'
'"abc "d e f"', # '«abc "d e f»'
'"abc "d" "e" f"', # '«abc «d» «e» f»'
# below won't work with \b when locale enabled
'"100 Естонiï"', # '«100 Естонiï»'
'"äöõ "ä õ ü" ï"', # '«äöõ «ä õ ü» ï»'
'"äöõ "ä õ ü""', # '«äöõ «ä õ ü»»'
'"äöõ "ä õ ü"', # '«äöõ «ä õ ü»'
'"äöõ "ä" "õ" ï"', # '«äöõ «ä» «õ» ï»'
);
my @test_emulate = ( # EXPECTED RESULT:
'"100 Естонiï"', # '«100 Естонiï»'
'"äöõ "ä õ ü" ï"', # '«äöõ «ä õ ü» ï»'
'"äöõ "ä õ ü""', # '«äöõ «ä õ ü»»'
'"äöõ "ä õ ü"', # '«äöõ "ä õ ü»'
'"äöõ "ä" "õ" ï"', # '«äöõ «ä» «õ» ï»'
);
say "BOUNDARY";
for my $sentence ( @test_boundary ) {
my $quote_count = ( $sentence =~ tr/"/"/ );
for ( my $i = 0 ; $i <= $quote_count ; $i += 2 ) {
$sentence =~ s/
"( # first qoute, start capture
[\p{Word}\.]+? # suva word-char
.*?\b[\.,?!»]*? # any char followed boundary + opt. punctuation
)" # stop capture, ending quote
/«$1»/xg; # change to fancy
}
say $sentence;
}
say "EMULATE";
for my $sentence ( @test_emulate ) {
my $quote_count = ( $sentence =~ tr/"/"/ );
for ( my $i = 0 ; $i <= $quote_count ; $i += 2 ) {
$sentence =~ s/
"( # first qoute, start capture
[\p{Word}\.]+? # at least one word-char or point
.*?(?=\P{Word}) # any char followed boundary
[\.,?!»]*? # optional punctuation
)" # stop capture, ending quote
/«$1»/gx; # change to fancy
}
say $sentence;
}
推荐答案
由于\b
位置后的字符是标点符号或"
(为安全起见,请仔细检查\p{Word}
匹配其中的任何一个),它属于大小写\b\W
.因此,我们可以使用以下命令模拟\b
:
Since the character after the position of the \b
is either some punctuation or "
(to be safe, please double check that \p{Word}
does not match any of them), it falls into the case \b\W
. Therefore, we can emulate \b
with:
(?<=\p{Word})
我不熟悉Perl,但是从我在这里进行的测试中,看来\w
(和\b
)在将编码设置为UTF-8时也可以很好地工作.
I am not familiar with Perl, but from what I tested here, it seems that \w
(and \b
) also works nicely when the encoding is set to UTF-8.
$sentence =~ s/
"(
[\w\.]+?
.*?\b[\.,?!»]*?
)"
/«$1»/xg;
如果您升级到Perl 5.14及更高版本,则可以使用u
标志将字符集设置为Unicode.
If you move up to Perl 5.14 and above, you can set the character set to Unicode with u
flag.
您可以使用这种一般策略来构造与字符类相对应的边界. (就像\b
单词边界定义如何基于\w
的定义一样.)
You can use this general strategy to construct a boundary corresponding to a character class. (Like how \b
word boundary definition is based on the definition of \w
).
让C
为字符类.我们想定义一个基于字符类C的边界.
Let C
be the character class. We would like to define a boundary that is based on the character class C.
当您知道当前字符属于C
字符类(等效于(\b\w)
)时,下面的构造将模拟前面的边界:
The construction below will emulate boundary in front when you know the current character belongs to C
character class (equivalent to (\b\w)
):
(?<!C)C
或后面(相当于\w\b
):
C(?!C)
为什么要使用负向环视?因为正向环视(具有互补字符类)还将断言必须在字符前后(断言 width 至少在1之前/之后).负向查找将允许在不编写繁琐的正则表达式的情况下开始/结束字符串.
Why negative look-around? Because positive look-around (with the complementary character class) will also assert that there must be a character ahead/behind (assert width ahead/behind at least 1). Negative look-around will allow for the case of beginning/ending of the string without writing a cumbersome regex.
对于\B\w
仿真:
(?<=C)C
以及类似的\w\B
:
C(?=C)
\B
与\b
直接相反,因此,我们可以翻转正/负环顾四周来模拟效果.这也很有意义-只有在前后有更多字符时,才能形成无边界.
\B
is the direct opposite of \b
, therefore, we can just flip the positive/negative look-around to emulate the effect. It also makes sense - a non-boundary can only be formed when there are more character ahead/behind.
其他模拟(让c
为C
的补码字符类):
Other emulations (let c
be the complement character class of C
):
-
\b\W
:(?<=C)c
-
\W\b
:c(?=C)
-
\B\W
:(?<!C)c
-
\W\B
:c(?!C)
\b\W
:(?<=C)c
\W\b
:c(?=C)
\B\W
:(?<!C)c
\W\B
:c(?!C)
用于模拟独立边界(等同于\b
):
For the emulation of a standalone boundary (equivalent to \b
):
(?:(?<!C)(?=C)|(?<=C)(?!C))
和独立的非边界(等同于\B
):
And standalone non-boundary (equivalent to \B
):
(?:(?<!C)(?!C)|(?<=C)(?=C))
这篇关于使用Unicode字符属性时如何模拟单词边界?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!