区域设置感知的Perl正则表达式(匹配单词边界) [英] Locale-aware Perl regular expressions (matching word boundaries)

查看:101
本文介绍了区域设置感知的Perl正则表达式(匹配单词边界)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前在Perl中很难获得正则表达式(取自)来匹配非ASCII语言环境中的单词字符(例如德语变音符号).

I'm currently somewhat stuck getting a regular expression in Perl (taken from an earlier question of mine) to match word characters from a non-ASCII locale (i.e., German umlauts).

我已经尝试过各种方法,例如设置正确的语言环境(使用setlocale),将从MySQL接收到的数据转换为UTF8(使用encode_utf8)等等,不幸的是,无济于事.谷歌也没有太大帮助.

I already tried various things such as setting the correct locale (using setlocale), converting data that I receive from MySQL to UTF8 (using decode_utf8), and so on... Unfortunately, to no avail. Google also did not help much.

是否有机会获得以下正则表达式可识别语言环境的信息

Is there any chance to get the following regex locale-aware so that

$street = "Täststraße"; # I know that this is not orthographically correct
$street =~ s{
               \b (\w{0,3}) (\w*) \b
            }
            {
               $1 . ( '*' x length $2 )
            }gex;

最终返回$street = "Täs*******"而不是"Tästs***ße"?

推荐答案

我希望正则表达式的结果为Täs*******".这就是当我在上面带有您的代码的utf-8编码文件中使用utf8"时得到的.

I would expect that the regex result in "Täs*******". And this is what I get when I "use utf8" in a utf-8 encoded file with your code above.

(如果一切都是latin-1,则这将改变正则表达式引擎的行为.因此,存在utf8::upgrade.请参见

(If everything is latin-1, that changes the behavior of the regex engine. Hence the existence of utf8::upgrade. See Unicode::Semantics.)

我看到您修正了您的信息,并且我们同意预期的结果.基本上,当您要在正则表达式上使用Unicode语义时,请使用Unicode :: Semantics.

I see you fixed your post and that we agree on the expected result. Basically, use Unicode::Semantics when you want Unicode semantics on your regexps.

这篇关于区域设置感知的Perl正则表达式(匹配单词边界)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆