如何在Perl中将字符串与变音符进行匹配? [英] How to match string with diacritic in perl?
问题描述
例如,在不带额外模块的情况下匹配Áñţérñåţîöñåļîžåţîöñ"中的"Nation".在新的Perl版本(5.14、5.15等)中可能吗?
For example, match "Nation" in ""Îñţérñåţîöñåļîžåţîöñ" without extra modules. Is it possible in new Perl versions (5.14, 5.15 etc)?
我找到了答案!感谢 tchrist
具有UCA匹配的完整解决方案(对 https://stackoverflow.com/users/471272/tchrist 表示感谢).
Rigth solution with UCA match (thnx to https://stackoverflow.com/users/471272/tchrist).
# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict;
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
normalization => undef, level => 1
);
my @match = $Collator->match($str, $look);
if (@match) {
my $found = $match[0];
my $f_len = length($found);
say "match result: $found (length is $f_len)";
my $offset = 0;
while ((my $start = index($str, $found, $offset)) != -1) {
my $end = $start + $f_len;
say sprintf("found at: %s,%s", $start, $end);
$offset = $end + 1;
}
}
来自 http://www.perlmonks.org/?node_id= 485681
神奇的代码是:
Magic piece of code is:
$str = Unicode::Normalize::NFD($str); $str =~ s/\pM//g;
代码示例:
code example:
use 5.014;
use utf8;
use Unicode::Normalize;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "Îñţérñåţîöñåļîžåţîöñ";
my $look = "Nation";
say "before: $str\n";
$str = NFD($str);
# M is short alias for \p{Mark} (http://perldoc.perl.org/perluniprops.html)
$str =~ s/\pM//og; # remove "marks"
say "after: $str";¬
say "is_match: ", $str =~ /$look/i || 0;
推荐答案
使用UCA(thnx到 tchrist )的正确解决方案:
Right solution with UCA (thnx to tchrist):
# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str = "Îñţérñåţîöñåļîžåţîöñ" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
normalization => undef, level => 1
);
my @match = $Collator->match($str, $look);
say "match ok!" if @match;
P.S. 假设您可以删除变音符号以获取基本ASCII字母的代码是邪恶的,静止的,残破的,损坏大脑的,错误的,并且是死刑的理由." © tchrist 为什么现代的Perl默认情况下会避免使用UTF-8吗?
P.S. "Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment." © tchrist Why does modern Perl avoid UTF-8 by default?
这篇关于如何在Perl中将字符串与变音符进行匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!