以高性能将 CESU-8 转换为 UTF-8 [英] Convert CESU-8 to UTF-8 with high performance

查看：135 发布时间：2021/6/15 19:36:08 php performance unicode utf-8 cesu-8

本文介绍了以高性能将 CESU-8 转换为 UTF-8的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一些原始文本，通常是有效的 UTF-8 字符串.然而，有时会发现输入实际上是一个 CESU-8 字符串.技术上可以检测到这一点并转换为 UTF-8，但由于这种情况很少发生，我宁愿不花大量 CPU 时间来执行此操作.

I have some raw text that is usually a valid UTF-8 string. However, every now and then it turns out that the input is in fact a CESU-8 string, instead. It is possible to technically detect this and convert to UTF-8 but as this happens rarely, I would rather not spend lots of CPU time to do this.

是否有任何快速方法来检测字符串是用 CESU-8 还是 UTF-8 编码的?我想我总是可以盲目地将UTF-8"转换为 UTF-16LE，然后使用 iconv() 再转换为 UTF-8，我可能每次都会得到正确的结果，因为 CESU-8 已经足够接近了到 UTF-8 才能工作.您能提出更快的建议吗?(我希望输入字符串是 CESU-8 而不是有效的 UTF-8，大约占所有字符串出现次数的 0.01-0.1%.)

Is there any fast method to detect if a string is encoded with CESU-8 or UTF-8? I guess I could always blindly convert "UTF-8" to UTF-16LE and then to UTF-8 using iconv() and I would probably get the correct result every time because CESU-8 is close enough to UTF-8 for this to work. Can you suggest anything faster? (I'm expecting the input string to be CESU-8 instead of valid UTF-8 around 0.01-0.1% of all string occurrences.)

(CESU-8 是一种非标准的字符串格式，它包含以 UTF-8 编码的 16 位代理对.从技术上讲，UTF-8 字符串应该包含由这些代理对表示的字符，而不是代理对本身.)

推荐答案

这里有一个更高效的转换函数版本:

Here's a more efficient version of your conversion function:

$regex = '@(\xED[\xA0-\xAF][\x80-\xBF]\xED[\xB0-\xBF][\x80-\xBF])@';
$s = preg_replace_callback($regex, function($m) {
    $in = unpack("C*", $m[0]);
    $in[2] += 1; // Effectively adds 0x10000 to the codepoint.
    return pack("C*",
        0xF0 | (($in[2] & 0x1C) >> 2),
        0x80 | (($in[2] & 0x03) << 4) | (($in[3] & 0x3C) >> 2),
        0x80 | (($in[3] & 0x03) << 4) | ($in[5] & 0x0F),
        $in[6]
    );
}, $s);

代码只转换高代理后低代理，将两个三字节的CESU-8序列直接转换成四字节的UTF-8序列，即来自

The code only converts high surrogates followed by low surrogates, and converts the two three-byte CESU-8 sequences directly into a four-byte UTF-8 sequence, i.e. from

ED       A0-AF    80-BF    ED       B0-BF    80-BF
11101101 1010aaaa 10bbbbbb 11101101 1011cccc 10dddddd

到

F0-F4    80-BF    80-BF    80-BF
11110oaa 10aabbbb 10bbcccc 10dddddd    // o is "overflow" bit

这是一个在线示例.

这篇关于以高性能将 CESU-8 转换为 UTF-8的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

以高性能将 CESU-8 转换为 UTF-8 [英] Convert CESU-8 to UTF-8 with high performance

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

以高性能将 CESU-8 转换为 UTF-8 [英] Convert CESU-8 to UTF-8 with high performance

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭