区域设置感知的Perl正则表达式(匹配单词边界) [英] Locale-aware Perl regular expressions (matching word boundaries)
问题描述
我目前在Perl中很难获得正则表达式(取自)来匹配非ASCII语言环境中的单词字符(例如德语变音符号).
I'm currently somewhat stuck getting a regular expression in Perl (taken from an earlier question of mine) to match word characters from a non-ASCII locale (i.e., German umlauts).
我已经尝试过各种方法,例如设置正确的语言环境(使用setlocale),将从MySQL接收到的数据转换为UTF8(使用encode_utf8)等等,不幸的是,无济于事.谷歌也没有太大帮助.
I already tried various things such as setting the correct locale (using setlocale), converting data that I receive from MySQL to UTF8 (using decode_utf8), and so on... Unfortunately, to no avail. Google also did not help much.
是否有机会获得以下正则表达式可识别语言环境的信息
Is there any chance to get the following regex locale-aware so that
$street = "Täststraße"; # I know that this is not orthographically correct
$street =~ s{
\b (\w{0,3}) (\w*) \b
}
{
$1 . ( '*' x length $2 )
}gex;
最终返回$street = "Täs*******"
而不是"Tästs***ße"
?
推荐答案
我希望正则表达式的结果为Täs*******".这就是当我在上面带有您的代码的utf-8编码文件中使用utf8"时得到的.
I would expect that the regex result in "Täs*******". And this is what I get when I "use utf8" in a utf-8 encoded file with your code above.
(如果一切都是latin-1,则这将改变正则表达式引擎的行为.因此,存在utf8::upgrade
.请参见
(If everything is latin-1, that changes the behavior of the regex engine. Hence the existence of utf8::upgrade
. See Unicode::Semantics.)
我看到您修正了您的信息,并且我们同意预期的结果.基本上,当您要在正则表达式上使用Unicode语义时,请使用Unicode :: Semantics.
I see you fixed your post and that we agree on the expected result. Basically, use Unicode::Semantics when you want Unicode semantics on your regexps.
这篇关于区域设置感知的Perl正则表达式(匹配单词边界)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!