为什么 Perl 对 Unicode 字符的字符串操作会给字符串添加垃圾? [英] Why do Perl string operations on Unicode characters add garbage to the string?

查看:21
本文介绍了为什么 Perl 对 Unicode 字符的字符串操作会给字符串添加垃圾?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Perl:

$string =~ s/[áàâã]/a/gi; #This line always prepends an "a"
$string =~ s/[éèêë]/e/gi;
$string =~ s/[úùûü]/u/gi;

这个正则表达式应该将été"转换成ete".相反,它将其转换为aetae".换句话说,它为每个匹配的元素添加一个a".甚至à"也被转换为aa".

This regular expression should convert "été" into "ete". Instead, it is converting it to "aetae". In other words, it prepends an "a" to every matched element. Even "à" is converted to "aa".

如果我把第一行改成这个

If I change the first line to this

$string =~ s/(á|à|â|ã)/a/gi;

它可以工作,但是...现在它会在每个匹配的元素(如eetee")前添加一个 e.

it works, but... Now it prepends an e to every matched element (like "eetee").

即使我找到了合适的解决方案,为什么它会这样?

Even though I found a suitable solution, why does it behave that way?

我添加了use utf8;",但它并没有改变行为(尽管它破坏了我在 JavaScript/AJAX).

I added "use utf8;", but it did not change the behavior (although it broke my output in JavaScript/AJAX).

流源自一个 Ajax 请求,由 jQuery 执行.它源自的站点设置为 UTF-8.

The Stream originates from an Ajax Request, performed by jQuery. The site it originates from is set to UTF-8.

我正在使用 Perl v5.10 (perl -v 返回这是 perl,为 i586-linux-thread-multi 构建的 v5.10.0").

I am using Perl v5.10 (perl -v returns "This is perl, v5.10.0 built for i586-linux-thread-multi").

推荐答案

我怀疑正在发生的事情是正则表达式的 [áàâã] 部分实际上不是匹配字符,而是匹配字节.这些字符的 UTF-8 编码在正则表达式中看起来就像这样:

I suspect that what is happening is that the [áàâã] part of your regular expression is not actually matching characters, but matching bytes. The UTF-8 encoding of those characters would look literally like this in the regular expression:

[xC3xA1xC3xA0xC3xA2xC3xA3]

因此,当输入正则表达式时,例如 'é' (xC3xA9),它一次查看一个字节,匹配 xC3,并用 'a' 替换它.它对它可以找到的所有 xC3 字节执行此操作.所以,'été' 变成了 'axA9taxA9'.

And so when the regular expression is fed, for example , 'é' (xC3xA9), it looks at it a byte at a time, matches the xC3, and replaces it with an 'a'. It does this for all of the xC3 bytes it can find. So, 'été' is turned into 'axA9taxA9'.

然后是第二个正则表达式,如下所示:

Then the second regular expression, which looks like this:

[xc3xA9xC3xA8xC3xAAxC3xAB]

出现,它匹配 xA9 部分,并用e"替换它.所以现在,'axA9taxA9' 变成了 'aetae'.

comes along, and it matches the xA9 portion, and replaces it with an 'e'. So now, 'axA9taxA9' is turned into 'aetae'.

当您将 [áàâã] 替换为 (á|à|â|ã) 时,它会在第一遍中正确匹配完整的字符,但随后您的第二个正则表达式出现了原始问题,并且 xC3 字符被替换为'e' 代替.

When you replace the [áàâã] with (á|à|â|ã), then that matches complete characters correctly on the first pass, but then your second regular expression has the original problem, and xC3 characters are replaced with 'e' instead.

如果这种情况仍然发生,即使使用 use utf8;,那么 Perl 正则表达式引擎.

If this is still happening, even with use utf8;, then there may be a bug (or at least a limitation) in the Perl regular expression engine.

这篇关于为什么 Perl 对 Unicode 字符的字符串操作会给字符串添加垃圾?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆