冲突:在C ++标准和Windows实现中wchar_t字符串的定义? [英] conflicts: definition of wchar_t string in C++ standard and Windows implementation?

查看:185
本文介绍了冲突:在C ++标准和Windows实现中wchar_t字符串的定义?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


从c ++ 2003 2.13



一个宽字符串文字具有类型 n const wchar_t ,并且有静态存储持续时间,其中n是下面定义的字符串的大小。



宽字符串文字的大小是转义序列的数目,通用字符名称和其他字符,加上一个用于终止L'\0'。



从c ++ 0x 2.14.5

一个宽字符串文字具有类型 n const wchar_t 的数组,其中n是下面定义的字符串



char32_t或宽字符串文字的大小是转义序列,通用字符名称和其他字符的总数,终止U'\0或L'\0'。



char16_t字符串文字的大小是转义序列,通用字符名称和其他字符的总数,每个字符需要一个代理对,加上一个用于终止u'\0'。


C ++ 2003中的语句相当含糊。但是在C ++ 0x中,当计算字符串的长度时,宽字符串文字wchar_t将被视为与char32_t相同,而不同于char16_t。



post状态明确说明windows如何实现wchar_t在 http://stackoverflow.com/questions/402283?tab=votes %23tab-top



总之,windows中的wchar_t是16位,使用UTF-16编码。标准中的语句显然在Windows中留下了冲突。



例如,

  wchar_t kk [] = \\ U000E0005; 

这超过了16位,而对于UTF-16,需要两个16位来编码它(一个代理对) 。



但是,从标准,kk是一个2 wchar_t的数组(通用名为\U000E005的1,\0为1)。



但是在内部存储中,Windows需要3个16位wchar_t对象来存储它,2个wchar_t用于代理对,1个wchar_t用于\0。因此,从数组的定义,kk是一个3 wchar_t的数组。



这显然是相互冲突的。



我认为Windows的一个最简单的解决方案是禁止任何需要代理对在wchar_t(禁止任何unicode外BMP)。



我的理解有什么问题吗?



感谢。

解决方案

c> wchar_t 足够大以容纳支持的字符集中的任何字符。基于这个,我认为你的前提是正确的 - 这是错误的VC ++代表单个字符 \U000E0005 使用两个 wchar_t 单位。



BMP外的字符很少使用,Windows本身内部使用UTF-16编码,因此它简单方便(即使不正确) VC ++以这种方式表现。但是, wchar_t 的大小在将来会增加,而不是禁止这样的字符,而 char16_t



您链接的答案有点误导:


在Linux上, wchar_t 是4字节,而在Windows上是2字节


wchar_t 的大小完全取决于编译器,与操作系统无关。它只是发生,VC ++为 wchar_t 使用2个字节,但是再次,这可能会在未来很好地改变。


From c++2003 2.13

A wide string literal has type "array of n const wchar_t" and has static storage duration, where n is the size of the string as defined below

The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.

From c++0x 2.14.5

A wide string literal has type "array of n const wchar_t", where n is the size of the string as defined below

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.

The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.

There's a post that states clearly how windows implements wchar_t in http://stackoverflow.com/questions/402283?tab=votes%23tab-top

In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.

for example,

wchar_t kk[] = L"\U000E0005";

This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).

However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).

But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.

It's apparently conflicting to each other.

I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).

Is there anything wrong with my understanding?

Thanks.

解决方案

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.

Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.

The answer you linked to is somewhat misleading as well:

On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

这篇关于冲突:在C ++标准和Windows实现中wchar_t字符串的定义?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆