从字符串PHP中删除多字节空格 [英] strip out multi-byte white space from a string PHP

查看:79
本文介绍了从字符串PHP中删除多字节空格的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用preg_replace从字符串输入中消除日语全角空格" ",但最终导致损坏的多字节字符串.

I am trying to use a preg_replace to eliminate the Japanese full-width white space " " from a string input but I end up with a corrupted multi-byte string.

我更喜欢preg_replace而不是str_replace. 这是示例代码:

I would prefer to preg_replace instead of str_replace. Here is a sample code:


$keywords = ' ラメ単色';
$keywords = str_replace(array(' ', ' '), ' ', urldecode($keywords)); // outputs :'ラメ単色'

$keywords = preg_replace("@[  ]@", ' ',urldecode($keywords)); // outputs :'�� ��単色'

任何人都知道为什么会这样以及如何纠正这种情况?

Anyone has any idea as to why this is so and how to remedy this situation?

推荐答案

u标志添加到您的正则表达式中.这使RegEx引擎将输入字符串视为UTF-8.

Add the u flag to your regex. This makes the RegEx engine treat the input string as UTF-8.

$keywords = preg_replace("@[  ]@u", ' ',urldecode($keywords));
// outputs :'ラメ単色'

CodePad .

之所以弄乱字符串,是因为对于RegEx引擎,您的替换字符20(空格)或e3 80 80(IDEOGRAPHIC SPACE)不会被视为两个字符,而是单独的字节2080.

The reason it mangles the string is because to the RegEx engine, your replacement characters, 20 (space) or e3 80 80 (IDEOGRAPHIC SPACE) are not treated as two characters, but separate bytes 20, e3 and 80.

当您查看要扫描的字符串的字节序列时,我们得到e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2.我们知道第一个字符是IDEOGRAPHIC SPACE,但是由于PHP将其视为字节序列,因此它会分别替换前四个字节,因为它们与正则表达式引擎正在扫描的单个字节匹配.

When you look at the byte sequence of your string to scan, we get e3 80 80 e3 83 a9 e3 83 a1 e5 8d 98 e8 89 b2. We know the first character is a IDEOGRAPHIC SPACE, but because PHP is treating it as a sequence of bytes, it does a replacement individually of the first four bytes, because they match individual bytes that the regex engine is scanning.

对于导致...的替换(REPLACEMENT CHARACTER),我们可以看到发生这种情况的原因是字节e3出现在字符串的后面. e3字节是三字节长的日语字符的起始字节,例如e3 83 a9(片假名RA).如果将前导e3替换为20(空格),它将不再成为有效的UTF-8序列.

As for the mangling which results in the � (REPLACEMENT CHARACTER), we can see this happens because the byte e3 is present further along in the string. The e3 byte is the start byte of a three byte long Japanese character, such as e3 83 a9 (KATAKANA LETTER RA). When that leading e3 is replaced with a 20 (space), it no longer becomes a valid UTF-8 sequence.

启用u标志时,RegEx引擎会将字符串视为UTF-8,而不会按字节对字符类中的字符进行处理.

When you enable the u flag, the RegEx engine treats the string as UTF-8, and won't treat your characters in your character class on a per-byte basis.

这篇关于从字符串PHP中删除多字节空格的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆