使用Unicode字符属性时如何模拟单词边界? [英] How to emulate word boundary when using unicode character properties?

查看:87
本文介绍了使用Unicode字符属性时如何模拟单词边界?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我之前的问题中为什么在locale-pragma字字符下如何更改嵌套引号,我知道处理UTF-8数据时,您不能将\w视为word-char,并且必须使用Unicode字符属性\p{Word}.现在,我发现零宽度字边界\b也不适用于UTF-8(启用了语言环境),但是在Unicode字符属性中找不到任何等效项.我以为自己可以像这样构造它:(?<=\P{Word})(\p{Word}+)(?=\P{Word}),它应该等效于\b(\w+)\b.

From my previous questions Why under locale-pragma word characters do not match? and How to change nested quotes I learnt that when dealing with UTF-8 data you can't trust \w as word-char and you must use the Unicode character property \p{Word}. Now I am in a situation where I found that zero-width word boundary \b also does not work with UTF-8 (with locale enabled), but I did not find any equivalent in Unicode character properties. I thought I may construct it myself like: (?<=\P{Word})(\p{Word}+)(?=\P{Word}), it should be equivalent to \b(\w+)\b.

在下面的测试脚本中,我有两个数组来测试两个不同的正则表达式.如果未启用语言环境,则第一个基于\b的方法可以正常工作.为了使其也能与语言环境一起使用,我编写了另一个模拟边界(?=\P{Word})的版本,但它没有按我预期的那样工作(我也在脚本中显示了预期的结果).

In the test script below I have two arrays to test two different regexes. The first based on \b works fine when locale is not enabled. To get it to also work with locales I wrote another version with emulating boundary (?=\P{Word}) but it does not work as I expected (I show expected results in script too).

您是否发现问题所在以及如何首先使用ASCII(或不使用语言环境)来模拟正则表达式?

Do you see what is wrong and how to get emulated regex work as first with ASCII (or without locale)?

#!/usr/bin/perl

use 5.010;
use utf8::all;
use locale; # et_EE.UTF-8 in my case
$| = 1;

my @test_boundary = (  # EXPECTED RESULT:
  '"abc def"',         # '«abc def»'
  '"abc "d e f" ghi"', # '«abc «d e f» ghi»'
  '"abc "d e f""',     # '«abc «d e f»»'
  '"abc "d e f"',      # '«abc "d e f»'
  '"abc "d" "e" f"',   # '«abc «d» «e» f»'
  # below won't work with \b when locale enabled
  '"100 Естонiï"',     #  '«100 Естонiï»'
  '"äöõ "ä õ ü" ï"',   # '«äöõ «ä õ ü» ï»'
  '"äöõ "ä õ ü""',     # '«äöõ «ä õ ü»»'
  '"äöõ "ä õ ü"',      # '«äöõ «ä õ ü»'
  '"äöõ "ä" "õ" ï"',   # '«äöõ «ä» «õ» ï»'
);

my @test_emulate = (   # EXPECTED RESULT:
  '"100 Естонiï"',     # '«100 Естонiï»'
  '"äöõ "ä õ ü" ï"',   # '«äöõ «ä õ ü» ï»'
  '"äöõ "ä õ ü""',     # '«äöõ «ä õ ü»»'
  '"äöõ "ä õ ü"',      # '«äöõ "ä õ ü»'
  '"äöõ "ä" "õ" ï"',   # '«äöõ «ä» «õ» ï»'
);

say "BOUNDARY";
for my $sentence ( @test_boundary ) {
  my $quote_count = ( $sentence =~ tr/"/"/ );

  for ( my $i = 0 ; $i <= $quote_count ; $i += 2 ) {
    $sentence =~ s/
      "(                          # first qoute, start capture
        [\p{Word}\.]+?            # suva word-char
        .*?\b[\.,?!»]*?           # any char followed boundary + opt. punctuation
      )"                          # stop capture, ending quote
      /«$1»/xg;                   # change to fancy
  }
  say $sentence;
}

say "EMULATE";
for my $sentence ( @test_emulate ) {
  my $quote_count =  ( $sentence =~ tr/"/"/ );

  for ( my $i = 0 ; $i <= $quote_count ; $i += 2 ) {
    $sentence =~ s/
      "(                         # first qoute, start capture
      [\p{Word}\.]+?             # at least one word-char or point
      .*?(?=\P{Word})            # any char followed boundary 
      [\.,?!»]*?                 # optional punctuation
      )"                         # stop capture, ending quote
      /«$1»/gx;                  # change to fancy
  }
  say $sentence;
}

推荐答案

由于\b位置后的字符是标点符号或"(为安全起见,请仔细检查\p{Word}匹配其中的任何一个),它属于大小写\b\W.因此,我们可以使用以下命令模拟\b:

Since the character after the position of the \b is either some punctuation or " (to be safe, please double check that \p{Word} does not match any of them), it falls into the case \b\W. Therefore, we can emulate \b with:

(?<=\p{Word})

我不熟悉Perl,但是从我在这里进行的测试中,看来\w(和\b)在将编码设置为UTF-8时也可以很好地工作.

I am not familiar with Perl, but from what I tested here, it seems that \w (and \b) also works nicely when the encoding is set to UTF-8.

$sentence =~ s/
  "(
    [\w\.]+?
    .*?\b[\.,?!»]*?
  )"
  /«$1»/xg;

如果您升级到Perl 5.14及更高版本,则可以使用u标志将字符集设置为Unicode.

If you move up to Perl 5.14 and above, you can set the character set to Unicode with u flag.

您可以使用这种一般策略来构造与字符类相对应的边界. (就像\b单词边界定义如何基于\w的定义一样.)

You can use this general strategy to construct a boundary corresponding to a character class. (Like how \b word boundary definition is based on the definition of \w).

C为字符类.我们想定义一个基于字符类C的边界.

Let C be the character class. We would like to define a boundary that is based on the character class C.

当您知道当前字符属于C字符类(等效于(\b\w))时,下面的构造将模拟前面的边界:

The construction below will emulate boundary in front when you know the current character belongs to C character class (equivalent to (\b\w)):

(?<!C)C

或后面(相当于\w\b):

C(?!C)

为什么要使用负向环视?因为正向环视(具有互补字符类)还将断言必须在字符前后(断言 width 至少在1之前/之后).负向查找将允许在不编写繁琐的正则表达式的情况下开始/结束字符串.

Why negative look-around? Because positive look-around (with the complementary character class) will also assert that there must be a character ahead/behind (assert width ahead/behind at least 1). Negative look-around will allow for the case of beginning/ending of the string without writing a cumbersome regex.

对于\B\w仿真:

(?<=C)C

以及类似的\w\B:

C(?=C)

\B\b直接相反,因此,我们可以翻转正/负环顾四周来模拟效果.这也很有意义-只有在前后有更多字符时,才能形成无边界.

\B is the direct opposite of \b, therefore, we can just flip the positive/negative look-around to emulate the effect. It also makes sense - a non-boundary can only be formed when there are more character ahead/behind.

其他模拟(让cC的补码字符类):

Other emulations (let c be the complement character class of C):

  • \b\W:(?<=C)c
  • \W\b:c(?=C)
  • \B\W:(?<!C)c
  • \W\B:c(?!C)
  • \b\W: (?<=C)c
  • \W\b: c(?=C)
  • \B\W: (?<!C)c
  • \W\B: c(?!C)

用于模拟独立边界(等同于\b):

For the emulation of a standalone boundary (equivalent to \b):

(?:(?<!C)(?=C)|(?<=C)(?!C))

和独立的非边界(等同于\B):

And standalone non-boundary (equivalent to \B):

(?:(?<!C)(?!C)|(?<=C)(?=C))

这篇关于使用Unicode字符属性时如何模拟单词边界?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆