为什么对Unicode字符进行Perl字符串操作会在字符串上添加垃圾? [英] Why do Perl string operations on Unicode characters add garbage to the string?

查看:89
本文介绍了为什么对Unicode字符进行Perl字符串操作会在字符串上添加垃圾?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Perl:

$string =~ s/[áàâã]/a/gi; #This line always prepends an "a"
$string =~ s/[éèêë]/e/gi;
$string =~ s/[úùûü]/u/gi;

此正则表达式应将été"转换为"ete".而是将其转换为"aetae".换句话说,它为每个匹配的元素加上一个"a".甚至à"也被转换为"aa".

This regular expression should convert "été" into "ete". Instead, it is converting it to "aetae". In other words, it prepends an "a" to every matched element. Even "à" is converted to "aa".

如果我将第一行更改为此

If I change the first line to this

$string =~ s/(á|à|â|ã)/a/gi;

它可以工作,但是...现在,它为每个匹配的元素(例如"eetee")加上一个e.

it works, but... Now it prepends an e to every matched element (like "eetee").

尽管我找到了合适的解决方案,但为什么会这样呢?

Even though I found a suitable solution, why does it behave that way?

我添加了"use utf8;",但是它没有改变行为(尽管它破坏了我在 JavaScript / AJAX ).

I added "use utf8;", but it did not change the behavior (although it broke my output in JavaScript/AJAX).

流源自Ajax请求,由 jQuery 执行.该网站的来源设置为 UTF-8 .

The Stream originates from an Ajax Request, performed by jQuery. The site it originates from is set to UTF-8.

我正在使用 Perl v5.10 (返回这是为i586-linux-thread-multi构建的perl,v5.10.0").

I am using Perl v5.10 (perl -v returns "This is perl, v5.10.0 built for i586-linux-thread-multi").

推荐答案

我怀疑正在发生的事情是正则表达式的[áàâã]部分实际上不是匹配的字符,而是匹配的字节.这些字符的UTF-8编码在正则表达式中看起来像这样:

I suspect that what is happening is that the [áàâã] part of your regular expression is not actually matching characters, but matching bytes. The UTF-8 encoding of those characters would look literally like this in the regular expression:

[\xC3\xA1\xC3\xA0\xC3\xA2\xC3\xA3]

因此,当输入正则表达式时,例如'é'(\ xC3 \ xA9),它一次看一个字节,匹配\ xC3,然后将其替换为'a'.它为它可以找到的所有\ xC3字节执行此操作.因此,été"变成了"a \ xA9ta \ xA9".

And so when the regular expression is fed, for example , 'é' (\xC3\xA9), it looks at it a byte at a time, matches the \xC3, and replaces it with an 'a'. It does this for all of the \xC3 bytes it can find. So, 'été' is turned into 'a\xA9ta\xA9'.

然后是第二个正则表达式,如下所示:

Then the second regular expression, which looks like this:

[\xc3\xA9\xC3\xA8\xC3\xAA\xC3\xAB]

随即出现,它与\ xA9部分匹配,并用'e'代替.所以现在,'a \ xA9ta \ xA9'变成了'aetae'.

comes along, and it matches the \xA9 portion, and replaces it with an 'e'. So now, 'a\xA9ta\xA9' is turned into 'aetae'.

当您将[áàâã]替换为(á|à|â|ã)时,则在第一遍中正确地匹配了完整字符,但是随后您的第二个正则表达式出现了原始问题,并且\ xC3字符被替换为改为"e".

When you replace the [áàâã] with (á|à|â|ã), then that matches complete characters correctly on the first pass, but then your second regular expression has the original problem, and \xC3 characters are replaced with 'e' instead.

如果即使使用use utf8;仍在发生这种情况,则 Perl 正则表达式引擎.

If this is still happening, even with use utf8;, then there may be a bug (or at least a limitation) in the Perl regular expression engine.

这篇关于为什么对Unicode字符进行Perl字符串操作会在字符串上添加垃圾?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆