如何在 C++ 中将 UTF-16 代理十进制转换为 UNICODE [英] How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

查看：55 发布时间：2021/9/15 19:39:06 c++ unicode utf-16 surrogate-pairs

本文介绍了如何在 C++ 中将 UTF-16 代理十进制转换为 UNICODE的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我从&#55357;&#56842;等参数中得到了一些字符串数据.

I got some string data from parameter such as &#55357;&#56842;.

这些是 Unicode 的 UTF-16 代理对，以十进制表示.

These are Unicode's UTF-16 surrogate pairs represented as decimal.

如何使用标准库将它们转换为 Unicode 代码点，例如U+1F62C"?

How can I convert them to Unicode code points such as "U+1F62C" with the standard library?

推荐答案

您可以轻松手动.从高 unicode 点传递到代理对并返回的算法并不难.UTF16 上的维基百科页面说:

You can easily to it by hand. The algorythm for passing from a high unicode point to the surrogate pair and back is not that hard. Wikipedia page on UTF16 says:

从代码点中减去 0x010000，留下 0..0x0FFFFF 范围内的 20 位数字.
将前十位(范围为 0..0x03FF 的数字)添加到 0xD800 以提供第一个 16 位代码单元或高代理，其范围为 0xD800..0xDBFF.
将低十位(也在 0..0x03FF 范围内)添加到 0xDC00 以提供第二个 16 位代码单元或低代理，其范围为 0xDC00..0xDFFF.

这只是按位与、或和移位，可以用 C 或 C++ 轻松实现.

That's just bitwise and, or and shift and can trivially be implemented in C or C++.

正如您所说要使用标准库，您要求的是从两个 16 位 UTF-16 代理转换为一个 32 位 unicode 代码点，所以 codecvt 是您的朋友，前提是您可以在 C++11 或更高模式下编译.

As you said you wanted to use the standard library, what you ask for is a conversion from two 16 bits UTF-16 surrogates to one 32 bits unicode code point, so codecvt is your friend, provided you can compile in C++11 mode or higher.

以下是在小端架构上处理您的值的示例:

Here is an example processing your values on a little endian architecture:

#include <iostream>
#include <locale>
#include <codecvt>

int main() {
    std::codecvt_utf16<char32_t, 0x10ffffUL,
    std::codecvt_mode::little_endian> cvt;
    mbstate_t state;

    char16_t pair[] = { 55357, 56842 };
    const char16_t *next;

    char32_t u[2];
    char32_t *unext;

    cvt.in(state, (const char *) pair, (const char *) (pair + 2),
        (const char *&) next, u, u+1, unext);

    std::cout << std::hex << (uint16_t) pair[0] << " " << (uint16_t) pair[1]
        << std::endl;
    std::cout << std::hex << (uint32_t) u[0] << std::endl;

    return 0;
}

输出符合预期:

d83d de0a
1f60a

这篇关于如何在 C++ 中将 UTF-16 代理十进制转换为 UNICODE的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在 C++ 中将 UTF-16 代理十进制转换为 UNICODE [英] How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

如何在 C++ 中将 UTF-16 代理十进制转换为 UNICODE [英] How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

问题描述

推荐答案

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭