正则表达式匹配不同 Unicode 脚本之间的边界 [英] Regular expression to match boundary between different Unicode scripts

查看:69
本文介绍了正则表达式匹配不同 Unicode 脚本之间的边界的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正则表达式引擎有一个零宽度"匹配的概念,其中一些对于查找单词的边缘很有用:

Regular expression engines have a concept of "zero width" matches, some of which are useful for finding edges of words:

  • \b - 存在于大多数引擎中以匹配单词和非单词字符之间的任何边界
  • \<\> - 存在于 Vim 中,只匹配单词开头和单词结尾的边界,
  • \b - present in most engines to match any boundary between word and non-word characters
  • \< and \> - present in Vim to match only the boundary at the beginning of a word, and at the end of a word, respectively.

某些正则表达式引擎中的一个新概念是 Unicode 类.其中一个类是 script,它可以区分拉丁文、希腊文、西里尔文等.这些例子都是等价的,可以匹配希腊文字系统的任何字符:

A newer concept in some regular expression engines is Unicode classes. One such class is script, which can distinguish Latin, Greek, Cyrillic, etc. These examples are all equivalent and match any character of the Greek writing system:

  • \p{greek}
  • \p{script=greek}
  • \p{script:greek}
  • [:script=greek:]
  • [:script:greek:]

但到目前为止,在阅读有关正则表达式和 Unicode 的资料时,我无法确定是否有任何标准或非标准方法来实现一个脚本结束而另一个脚本开始的零宽度匹配.

But so far in my reading through sources on regular expressions and Unicode I haven't been able to determine if there is any standard or nonstandard way to achieve a zero-width match where one script ends and another begins.

在字符串 παν语 中,ν 字符之间会匹配,就像 \b\< 将匹配 π 字符之前.

In the string παν語 there would be a match between the ν and characters, just as \b and \< would match just before the π character.

现在对于这个例子,我可以通过查找 \p{Greek}\p{Han} 来一起破解一些东西,我什至可以一起破解一些东西基于两个 Unicode 脚本名称的所有可能组合.

Now for this example I could hack something together based on looking for \p{Greek} followed by \p{Han}, and I could even hack something together based on all possible combinations of two Unicode script names.

但这不是一个确定性的解决方案,因为每个版本都会将新脚本添加到 Unicode.是否有一种面向未来的表达方式?或者有没有建议添加它?

But this wouldn't be a deterministic solution since new scripts are still being added to Unicode with each release. Is there a future-proof way to express this? Or is there a proposal to add it?

推荐答案

我刚刚注意到您实际上并没有指定which您使用的模式匹配语言.好吧,我希望 Perl 解决方案对您有用,因为所需的机制在任何其他语言中都可能非常困难.另外,如果您正在使用 Unicode 进行模式匹配,那么 Perl 确实是该特定类型工作的最佳选择.

I just noticed you didn’t actually specify which pattern-matching language you were using. Well, I hope a Perl solution will work for you, since the needed mechanations are likely to be really tough in any other language. Plus if you’re doing pattern matching with Unicode, Perl really is the best choice available for that particular kind of work.

当下面的 $rx 变量被设置为适当的模式时,这个 Perl 代码的小片段:

When the $rx variable below is set to the appropriate pattern, this little snippet of Perl code:

my $data = "foo1 and Πππ 語語語 done";

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n"; 
} 

生成此输出:

Got string: 'foo1 and '
Got string: 'Πππ '
Got string: '語語語 '
Got string: 'done'

即拉出一个拉丁字符串、一个希腊字符串、一个汉字符串和另一个拉丁字符串.这与我认为您实际需要的东西非常接近.

That is, it pulls out a Latin string, a Greek string, a Han string, and another Latin string. This is pretty darned closed to what I think you actually need.

我昨天没有发布的原因是我收到了奇怪的核心转储.现在我知道为什么了.

The reason I didn’t post this yesterday is that I was getting weird core dumps. Now I know why.

我的解决方案在 (??{...}) 结构中使用词法变量.事实证明,这在 v5.17.1 之前是不稳定的,充其量只是偶然工作.它在 v5.17.0 上失败,但在 v5.18.0 RC0 和 RC2 上成功.所以我添加了一个 use v5.17.1 来确保你运行的东西足够新,可以信任这种方法.

My solution uses lexical variables inside of a (??{...}) construct. Turns out that that is unstable before v5.17.1, and at best worked only by accident. It fails on v5.17.0, but succeeds on v5.18.0 RC0 and RC2. So I’ve added a use v5.17.1 to make sure you’re running something recent enough to trust with this approach.

首先,我认为您实际上并不想要运行所有相同的脚本类型;您想要运行所有相同的脚本类型 plus Common 和 Inherited.否则你会被标点符号和空格以及 Common 的数字以及组合字符的 Inherited 搞砸.我真的不认为你希望那些打断你所有相同的脚本"的运行,但如果你这样做了,很容易停止考虑这些.

First, I decided that you didn’t actually want a run of all the same script type; you wanted a run of all the same script type plus Common and Inherited. Otherwise you will get messed up by punctuation and whitespace and digits for Common, and by combining characters for Inherited. I really don’t think you want those to interrupt your run of "all the same script", but if you do, it’s easy to stop considering those.

所以我们要做的是先行查找第一个脚本类型不是 Common 或 Inherited 的字符.更重要的是,我们从中提取出该脚本类型实际是什么,并使用此信息构建一个新模式,该模式是任意数量的字符,其脚本类型为 Common、Inherited 或我们刚刚找到并保存的任何脚本类型.然后我们评估新模式并继续.

So what we do is lookahead for the first character that has a script type of other than Common or Inherited. More than that, we extract from it what that script type actually is, and use this information to construct a new pattern that is any number of characters whose script type is either Common, Inherited, or whatever script type we just found and saved off. Then we evaluate that new pattern and continue.

嘿,我它有毛,不是吗?

Hey, I said it was hairy, didn’t I?

在我即将展示的程序中,我留下了一些已注释掉的调试语句,这些语句仅显示了它在做什么.如果您取消注释它们,您将获得上次运行的输出,这应该有助于理解该方法:

In the program I’m about to show, I’ve left in some commented-out debugging statements that show just what it’s doing. If you uncomment them, you get this output for the last run, which should help understand the approach:

DEBUG: Got peekahead character f, U+0066
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'foo1 and '
DEBUG: Got peekahead character Π, U+03a0
DEBUG: Scriptname is Greek
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Greek}]*}
Got string: 'Πππ '
DEBUG: Got peekahead character 語, U+8a9e
DEBUG: Scriptname is Han
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Han}]*}
Got string: '語語語 '
DEBUG: Got peekahead character d, U+0064
DEBUG: Scriptname is Latin
DEBUG: string to re-interpolate as regex is q{[\p{Script=Common}\p{Script=Inherited}\p{Script=Latin}]*}
Got string: 'done'

最后是一件大事:

use v5.17.1;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);
use utf8;

use Unicode::UCD qw(charscript);

# regex to match a string that's all of the
# same Script=XXX type
#
my $rx = qr{
    (?=
       [\p{Script=Common}\p{Script=Inherited}] *
        (?<CAPTURE>
            [^\p{Script=Common}\p{Script=Inherited}]
        )
    )
    (??{
        my $capture = $+{CAPTURE};
   #####printf "DEBUG: Got peekahead character %s, U+%04x\n", $capture, ord $capture;
        my $scriptname = charscript(ord $capture);
   #####print "DEBUG: Scriptname is $scriptname\n";
        my $run = q([\p{Script=Common}\p{Script=Inherited}\p{Script=)
                . $scriptname
                . q(}]*);
   #####print "DEBUG: string to re-interpolate as regex is q{$run}\n";
        $run;
    })
}x;


my $data = "foo1 and Πππ 語語語 done";

$| = 1;

while ($data =~ /($rx)/g) {
   print "Got string: '$1'\n";
}

是的,应该有更好的方法.我认为还没有.

Yeah, there oughta be a better way. I don’t think there is—yet.

现在,享受吧.

这篇关于正则表达式匹配不同 Unicode 脚本之间的边界的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆