冲突：在C ++标准和Windows实现中wchar_t字符串的定义？ [英] conflicts: definition of wchar_t string in C++ standard and Windows implementation?

查看：185 发布时间：2016/10/23 23:10:51 c++ unicode c++11 string-literals wchar-t

本文介绍了冲突：在C ++标准和Windows实现中wchar_t字符串的定义？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

从c ++ 2003 2.13

一个宽字符串文字具有类型 n const wchar_t ，并且有静态存储持续时间，其中n是下面定义的字符串的大小。

宽字符串文字的大小是转义序列的数目，通用字符名称和其他字符，加上一个用于终止L'\0'。

从c ++ 0x 2.14.5

一个宽字符串文字具有类型 n const wchar_t 的数组，其中n是下面定义的字符串

char32_t或宽字符串文字的大小是转义序列，通用字符名称和其他字符的总数，终止U'\0或L'\0'。

char16_t字符串文字的大小是转义序列，通用字符名称和其他字符的总数，每个字符需要一个代理对，加上一个用于终止u'\0'。

C ++ 2003中的语句相当含糊。但是在C ++ 0x中，当计算字符串的长度时，宽字符串文字wchar_t将被视为与char32_t相同，而不同于char16_t。

post状态明确说明windows如何实现wchar_t在 http://stackoverflow.com/questions/402283?tab=votes ％23tab-top

总之，windows中的wchar_t是16位，使用UTF-16编码。标准中的语句显然在Windows中留下了冲突。

例如，

  wchar_t kk [] = \\ U000E0005;

这超过了16位，而对于UTF-16，需要两个16位来编码它（一个代理对）。

 
 
 但是，从标准，kk是一个2 wchar_t的数组（通用名为\U000E005的1，\0为1）。
 
 
 但是在内部存储中，Windows需要3个16位wchar_t对象来存储它，2个wchar_t用于代理对，1个wchar_t用于\0。因此，从数组的定义，kk是一个3 wchar_t的数组。
 
 
 这显然是相互冲突的。
 
 
 我认为Windows的一个最简单的解决方案是禁止任何需要代理对在wchar_t（禁止任何unicode外BMP）。 
 
 
 我的理解有什么问题吗？ 
 
 
 感谢。
解决方案

c> wchar_t 足够大以容纳支持的字符集中的任何字符。基于这个，我认为你的前提是正确的 - 这是错误的VC ++代表单个字符 \U000E0005 使用两个 wchar_t 单位。

BMP外的字符很少使用，Windows本身内部使用UTF-16编码，因此它简单方便（即使不正确） VC ++以这种方式表现。但是， wchar_t 的大小在将来会增加，而不是禁止这样的字符，而 char16_t

您链接的答案有点误导：

在Linux上， wchar_t 是4字节，而在Windows上是2字节

wchar_t 的大小完全取决于编译器，与操作系统无关。它只是发生，VC ++为 wchar_t 使用2个字节，但是再次，这可能会在未来很好地改变。

From c++2003 2.13

A wide string literal has type "array of n const wchar_t" and has static storage duration, where n is the size of the string as defined below

The size of a wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating L’\0’.

From c++0x 2.14.5

A wide string literal has type "array of n const wchar_t", where n is the size of the string as defined below

The size of a char32_t or wide string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for the terminating U’\0’ or L’\0’.

The size of a char16_t string literal is the total number of escape sequences, universal-character-names, and other characters, plus one for each character requiring a surrogate pair, plus one for the terminating u’\0’.

The statement in C++2003 is quite vague. But in C++0x, when counting the length of the string, the wide string literal wchar_t shall be treated as same as char32_t, and different from char16_t.

There's a post that states clearly how windows implements wchar_t in http://stackoverflow.com/questions/402283?tab=votes%23tab-top

In short, wchar_t in windows is 16bits and encoded using UTF-16. The statement in standard apparently leaves something conflicting in Windows.

for example,

wchar_t kk[] = L"\U000E0005";

This exceeds 16bits and for UTF-16 it needs two 16 bits to encode it (a surrogate pair).

However, from standard, kk is an array of 2 wchar_t (1 for the universal-name \U000E005, 1 for \0).

But in the internal storage, Windows need 3 16-bit wchar_t objects to store it, 2 wchar_t for the surrogate pair, and 1 wchar_t for the \0. Therefore, from array's definition, kk is an array of 3 wchar_t.

It's apparently conflicting to each other.

I think one simplest solution for Windows is to "ban" anything that requires surrogate pair in wchar_t ("ban" any unicode outside BMP).

Is there anything wrong with my understanding?

Thanks.

解决方案

The standard requires that wchar_t be large enough to hold any character in the supported character set. Based on this, I think your premise is correct -- it is wrong for VC++ to represent the single character \U000E0005 using two wchar_t units.

Characters outside the BMP are rarely used, and Windows itself internally uses UTF-16 encoding, so it is simply convenient (even if incorrect) for VC++ to behave this way. However, rather than "banning" such characters, it is likely that the size of wchar_t will increase in the future while char16_t takes its place in the Windows API.

The answer you linked to is somewhat misleading as well:

On Linux, a wchar_t is 4-bytes, while on Windows, it's 2-bytes

The size of wchar_t depends solely on the compiler and has nothing to do with the operating system. It just happens that VC++ uses 2 bytes for wchar_t, but once again, this could very well change in the future.

这篇关于冲突：在C ++标准和Windows实现中wchar_t字符串的定义？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

冲突：在C ++标准和Windows实现中wchar_t字符串的定义？ [英] conflicts: definition of wchar_t string in C++ standard and Windows implementation?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

冲突：在C ++标准和Windows实现中wchar_t字符串的定义？ [英] conflicts: definition of wchar_t string in C++ standard and Windows implementation?

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭