不使用json_encode将Unicode符号转换为\uXXXX [英] Convert unicode symbols to \uXXXX, not using json_encode
问题描述
我需要一个可以将非ASCII符号正确转换为\uXXXX表示形式的函数。
我知道json_encode可以做到这一点,但是它在字符串中添加了双引号,并且我认为可能有一个更精致的解决方案,与每个符号使用json_encode相比,它消耗更少的CPU。
I need a function which will properly convert a non-ASCII symbols to \uXXXX representation. I know json_encode does that, but it adds double quotes to the string and I assume there might be a more refined solution, consuming less CPU than in case of using json_encode per each symbol.
这是当前的解决方案:
$input=preg_replace_callback('#([^\r\n\t\x20-\x7f])#u', function($m) {
return trim(json_encode($m[1]),'"');
}, $input);
有人会想到更简单,更快速的解决方案吗?
Does anyone have an idea of a simplier and faster solution?
推荐答案
由于您当前的解决方案使用 u
正则表达式修饰符,因此假设您输入了编码为UTF-8。
Since your current solution uses the u
regex modifier, I'm assuming your input is encoded as UTF-8.
以下解决方案绝对不是更简单(除正则表达式外),我什至不知道它的速度更快,但它的价格更低。级别,并显示实际的转义过程。
The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.
$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
$utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
if (strlen($utf16) <= 2) {
$esc = '\u' . bin2hex($utf16);
}
else {
$esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
'\u' . bin2hex(substr($utf16, 2, 2));
}
return $esc;
}, $input);
一个基本问题是PHP没有与UTF-8配合使用的 ord
函数。您要么必须使用 mb_convert_encoding
,要么必须滚动自己的UTF-8解码器(请参阅链接的问题),这将允许进行其他优化。两字节和三字节的UTF-8序列映射到单个UTF-16代码单元。四字节序列需要两个代码单元(高和低替代)。
One fundamental problem is that PHP doesn't have an ord
function that works with UTF-8. You either have to use mb_convert_encoding
, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).
如果您出于简单性和可读性的考虑,则可能无法击败 json_encode
方法。
If you're aiming for simplicity and readability, you probably can't beat the json_encode
approach.
这篇关于不使用json_encode将Unicode符号转换为\uXXXX的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!