不使用json_encode将Unicode符号转换为\uXXXX [英] Convert unicode symbols to \uXXXX, not using json_encode

查看:276
本文介绍了不使用json_encode将Unicode符号转换为\uXXXX的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个可以将非ASCII符号正确转换为\uXXXX表示形式的函数。
我知道json_encode可以做到这一点,但是它在字符串中添加了双引号,并且我认为可能有一个更精致的解决方案,与每个符号使用json_encode相比,它消耗更少的CPU。

I need a function which will properly convert a non-ASCII symbols to \uXXXX representation. I know json_encode does that, but it adds double quotes to the string and I assume there might be a more refined solution, consuming less CPU than in case of using json_encode per each symbol.

这是当前的解决方案:

    $input=preg_replace_callback('#([^\r\n\t\x20-\x7f])#u', function($m) {
        return trim(json_encode($m[1]),'"');
    }, $input);

有人会想到更简单,更快速的解决方案吗?

Does anyone have an idea of a simplier and faster solution?

推荐答案

由于您当前的解决方案使用 u 正则表达式修饰符,因此假设您输入了编码为UTF-8。

Since your current solution uses the u regex modifier, I'm assuming your input is encoded as UTF-8.

以下解决方案绝对不是更简单(除正则表达式外),我什至不知道它的速度更快,但它的价格更低。级别,并显示实际的转义过程。

The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.

$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
    $utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
    if (strlen($utf16) <= 2) {
        $esc = '\u' . bin2hex($utf16);
    }
    else {
        $esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
               '\u' . bin2hex(substr($utf16, 2, 2));
    }
    return $esc;
}, $input);

一个基本问题是PHP没有与UTF-8配合使用的 ord 函数。您要么必须使用 mb_convert_encoding ,要么必须滚动自己的UTF-8解码器(请参阅链接的问题),这将允许进行其他优化。两字节和三字节的UTF-8序列映射到单个UTF-16代码单元。四字节序列需要两个代码单元(高和低替代)。

One fundamental problem is that PHP doesn't have an ord function that works with UTF-8. You either have to use mb_convert_encoding, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).

如果您出于简单性和可读性的考虑,则可能无法击败 json_encode 方法。

If you're aiming for simplicity and readability, you probably can't beat the json_encode approach.

这篇关于不使用json_encode将Unicode符号转换为\uXXXX的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆