Windows 上代理对的 wchar_t* 大小(BMP 之外的 Unicode 字符) [英] Size of wchar_t* for surrogate pair (Unicode character out of BMP) on Windows

查看:33
本文介绍了Windows 上代理对的 wchar_t* 大小(BMP 之外的 Unicode 字符)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Windows 8 上遇到了一个有趣的问题.我测试过我可以用 wchar_t* 字符串表示 BMP 之外的 Unicode 字符.以下测试代码给我带来了意想不到的结果:

I have encountered an interesting issue on Windows 8. I tested I can represent Unicode characters which are out of the BMP with wchar_t* strings. The following test code produced unexpected results for me:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess).
int i3 = sizeof(s2); // i3 == 4, why?

U+2008A 是汉字,它在二进制多语言窗格之外,所以它应该用 UTF-16 中的代理对表示.这意味着 - 如果我理解正确的话 - 它应该由两个 wchar_t 字符表示.所以我预计 sizeof(s2) 为 6(代理对的两个 wchar_t-s 为 4,终止 \0 为 2).

The U+2008A is the Han character, which is out of the Binary Multilingual Pane, so it should be represented by a surrogate pair in UTF-16. Which means - if I understand it correctly - that it should be represented by two wchar_t characters. So I expected sizeof(s2) to be 6 (4 for the two wchar_t-s of the surrogate pair and 2 for the terminating \0).

那么为什么 sizeof(s2) == 4?我测试了s2字符串构造正确,因为我用DirectWrite渲染过,汉字显示正确.

So why is sizeof(s2) == 4? I tested that the s2 string has been constructed correctly, because I've rendered it with DirectWrite, and the Han character was displayed correctly.

更新:正如 Naveen 指出的,我试图错误地确定数组的大小.以下代码产生正确的结果:

UPDATE: As Naveen pointed out, I tried to determine the size of the arrays incorrectly. The following code produces correct result:

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

std::wstring str1 (s1);
std::wstring str2 (s2);

int i2 = str1.size(); // i2 == 1.
int i3 = str2.size(); // i3 == 2, because two wchar_t characters needed for the surrogate pair.

推荐答案

sizeof(s2) 返回存储指针 s2 或任何其他指针所需的字节数,在您的系统上是 4 个字节.它与 s2 指向的字符 stored in 无关.

sizeof(s2) returns the number of bytes required to store the pointer s2 or any other pointer, which is 4 bytes on your system. It has nothing to do with the character(s) stored in pointed to by s2.

这篇关于Windows 上代理对的 wchar_t* 大小(BMP 之外的 Unicode 字符)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆