不使用json_encode将Unicode符号转换为\uXXXX [英] Convert unicode symbols to \uXXXX, not using json_encode

查看：276 发布时间：2020/10/1 1:14:49 php unicode character-encoding

本文介绍了不使用json_encode将Unicode符号转换为\uXXXX的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要一个可以将非ASCII符号正确转换为\uXXXX表示形式的函数。
我知道json_encode可以做到这一点，但是它在字符串中添加了双引号，并且我认为可能有一个更精致的解决方案，与每个符号使用json_encode相比，它消耗更少的CPU。

I need a function which will properly convert a non-ASCII symbols to \uXXXX representation. I know json_encode does that, but it adds double quotes to the string and I assume there might be a more refined solution, consuming less CPU than in case of using json_encode per each symbol.

这是当前的解决方案：

    $input=preg_replace_callback('#([^\r\n\t\x20-\x7f])#u', function($m) {
        return trim(json_encode($m[1]),'"');
    }, $input);

有人会想到更简单，更快速的解决方案吗？

Does anyone have an idea of a simplier and faster solution?

推荐答案

由于您当前的解决方案使用 u 正则表达式修饰符，因此假设您输入了编码为UTF-8。

Since your current solution uses the u regex modifier, I'm assuming your input is encoded as UTF-8.

以下解决方案绝对不是更简单（除正则表达式外），我什至不知道它的速度更快，但它的价格更低。级别，并显示实际的转义过程。

The following solution is definitely not simpler (apart from the regex) and I'm not even sure it's faster, but it's more low-level and shows the actual escaping procedure.

$input = preg_replace_callback('#[^\x00-\x7f]#u', function($m) {
    $utf16 = mb_convert_encoding($m[0], 'UTF-16BE', 'UTF-8');
    if (strlen($utf16) <= 2) {
        $esc = '\u' . bin2hex($utf16);
    }
    else {
        $esc = '\u' . bin2hex(substr($utf16, 0, 2)) .
               '\u' . bin2hex(substr($utf16, 2, 2));
    }
    return $esc;
}, $input);

一个基本问题是PHP没有与UTF-8配合使用的 ord 函数。您要么必须使用 mb_convert_encoding ，要么必须滚动自己的UTF-8解码器（请参阅链接的问题），这将允许进行其他优化。两字节和三字节的UTF-8序列映射到单个UTF-16代码单元。四字节序列需要两个代码单元（高和低替代）。

One fundamental problem is that PHP doesn't have an ord function that works with UTF-8. You either have to use mb_convert_encoding, or you have to roll your own UTF-8 decoder (see linked question) which would allow for additional optimizations. Two- and three-byte UTF-8 sequences map to a single UTF-16 code unit. Four-byte sequences require two code units (high and low surrogate).

如果您出于简单性和可读性的考虑，则可能无法击败 json_encode 方法。

If you're aiming for simplicity and readability, you probably can't beat the json_encode approach.

这篇关于不使用json_encode将Unicode符号转换为\uXXXX的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

不使用json_encode将Unicode符号转换为\uXXXX [英] Convert unicode symbols to \uXXXX, not using json_encode

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

不使用json_encode将Unicode符号转换为\uXXXX [英] Convert unicode symbols to \uXXXX, not using json_encode

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭