C ++标准是否要求对wchar_t进行编码? [英] Does the C++ standard mandate an encoding for wchar_t?

查看:114
本文介绍了C ++标准是否要求对wchar_t进行编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下是我摘录的2014年标准N4140草案的摘录

Here are some excerpts from my copy of the 2014 draft standard N4140


22.5标准代码转换方面[locale.stdcvt ]

3对于三个代码转换方面的每一个, codecvt_utf8 codecvt_utf16 codecvt_utf8_utf16

(3.1)- Elem 是宽字符类型,例如 wchar_t char16_t char32_t

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:
(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4对于构面 codecvt_utf8

(4.1)—构面应在程序内的UTF-8多字节序列与UCS2或UCS4(取决于 Elem 的大小)之间转换。

4 For the facet codecvt_utf8:
(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

对这两段的一种解释是, wchar_t 必须编码为UCS2或UCS4。我不太喜欢它,因为如果这是真的,那么我们就拥有一种深藏在图书馆描述中的语言的重要属性。我试图找到对该属性的更直接声明,但无济于事。

One interpretation of these two paragraphs is that wchar_t must be encoded as either UCS2 or UCS4. I don't like it much because if it's true, we have an important property of the language buried deep in a library description. I have tried to find a more direct statement of this property, but to no avail.

另一种解释是 wchar_t 编码不需要是UCS2或UCS4,在没有编码的情况下, codecvt_utf8 不适用于 wchar_t 。我也不太喜欢这种解释,因为如果它是真的,并且 char wchar_t 本机编码都不是Unicode ,似乎没有办法在这些本机编码和Unicode之间进行可移植的转换。

Another interpretation that wchar_t encoding is not required to be either UCS2 or UCS4, and on implementations where it isn't, codecvt_utf8 won't work for wchar_t. I don't like this interpretation much either, because if it's true, and neither char nor wchar_t native encodings are Unicode, there doesn't seem to be a way to portably convert between those native encodings and Unicode.

这两种解释中的哪一种是正确的?

Which of the two interpretations is true? Is there another one which I overlooked?

澄清我不是在问关于 wchar_t 用于软件开发或 wchar_t 的属性可以从其他地方获得。我对标准的这两个特定段落感兴趣。我正在尝试了解这些特定段落是什么还是不需要的。

Clarification I'm not asking about general opinions about suitability of wchar_t for software development, or properties of wchar_t one can derive from elsewhere. I am interested in these two specific paragraphs of the standard. I'm trying to understand what these specific paragraphs entail or do not entail.

说明2 。如果4.1表示构面应在UTF-8多字节序列与UCS2或UCS4 或当前全局语言环境对wchar_t施加的任何编码之间进行转换,则不会有问题。没错它说了什么。看来,如果使用 std :: codecvt_utf8< wchar_t> ,则最终会得到一堆 wchar_t 编码为UCS2或UCS4,与当前的全局语言环境无关。 (无法为 codecvt_utf8 指定语言环境或任何字符转换构面)。因此,问题可以这样表述:转换结果是否可以直接与当前全局语言环境(和/或任何可能的语言环境)一起用于输出, wctype 查询等? ?如果没有,对于有什么用? (如果上面的第二种解释是正确的,那么答案似乎是无。)

Clarification 2. If 4.1 said "The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 or whatever encoding is imposed on wchar_t by the current global locale" there would be no problem. It doesn't. It says what it says. It appears that if one uses std::codecvt_utf8<wchar_t>, one ends up with a bunch of wchar_t encoded as UCS2 or UCS4, regardless of the current global locale. (There is no way to specify a locale or any character conversion facet for codecvt_utf8). So the question can be rephrased like this: is the conversion result directly usable with the current global locale (and/or with any possible locale) for output, wctype queries and so on? If not, what it is usable for? (If the second interpretation above is correct, the answer would seem to be "nothing").

推荐答案

wchar_t 只是一个整数。它具有最小值,最大值等。

wchar_t is just an integral literal. It has a min value, a max value, etc.

其大小不受标准限制。

如果大小足够大,您可以可以将UCS-2或UCS-4数据存储在 wchar_t 的缓冲区中。无论您使用哪种系统,都是如此,因为UCS-2和UCS-4和UTF-16和UTF-32只是对按顺序排列的整数值的描述。

If it is large enough, you can store UCS-2 or UCS-4 data in a buffer of wchar_t. This is true regardless of the system you are on, as UCS-2 and UCS-4 and UTF-16 and UTF-32 are just descriptions of integer values arranged in a sequence.

In C ++ 11中,有 std 个API假定它们具有这些编码,就可以读取或写入数据。在C ++ 03中,存在使用当前语言环境读取或写入数据的API。

In C++11, there are std APIs that read or write data presuming it has those encodings. In C++03, there are APIs that read or write data using the current locale.


22.5标准代码转换方面[locale.stdcvt]

22.5 Standard code conversion facets [locale.stdcvt]

3对于三个代码转换方面codecvt_utf8,codecvt_utf16和codecvt_utf8_utf16:

3 For each of the three code conversion facets codecvt_utf8, codecvt_utf16, and codecvt_utf8_utf16:

(3.1)— Elem是宽字符类型,例如wchar_t ,char16_t或char32_t。

(3.1) — Elem is the wide-character type, such as wchar_t, char16_t, or char32_t.

4对于构面代码cvt_utf8:

4 For the facet codecvt_utf8:

(4.1)—构面应在UTF-8多字节序列和UCS2之间转换或程序中的UCS4(取决于Elem的大小)。

(4.1) — The facet shall convert between UTF-8 multibyte sequences and UCS2 or UCS4 (depending on the size of Elem) within the program.

因此在这里 codecvt_utf8_utf16 一方面处理 utf8 ,另一方面处理UCS2或UCS4(取决于Elem的大小)。它会进行转换。

So here it codecvt_utf8_utf16 deals with utf8 on one side, and UCS2 or UCS4 (depending on how big Elem is) on the other. It does conversion.

Elem(宽字符)根据其大小推测是用UCS2或UCS4编码的。

The Elem (the wide character) is presumed to be encoded in UCS2 or UCS4 depending on how big it is.

此并不意味着 wchar_t 是这样编码的,它只是表示此操作将 wchar_t 解释为已编码

This does not mean that wchar_t is encoded as such, it just means this operation interprets the wchar_t as being encoded as such.

UCS2或UCS4如何进入Elem并不是标准的这一部分所关心的。也许您使用十六进制常量在其中设置它。也许您是从io阅读的。也许您是即时计算出来的。也许您使用了高质量的随机数生成器。也许您将 ascii 字符串的位值相加了。也许您计算了 log * 的定点近似值,它表示月亮将地球的日期改变1秒所需的秒数。 这些段落没有问题。这些段落只是要求如何修改和解释位。

How the UCS2 or UCS4 got into the Elem is not something this part of the standard cares about. Maybe you set it in there with hex constants. Maybe you read it from io. Maybe you calculated it on the fly. Maybe you used a high-quality random-number generator. Maybe you added together the bit-values of an ascii string. Maybe you calculated a fixed-point approximation of the log* of the number of seconds it takes the moon to change the Earth's day by 1 second. Not these paragraphs problems. These pragraphs simply mandate how bits are modified and interpreted.

在其他情况下,类似的主张也成立。这并不要求 wchar_t 具有什么格式。它只是简单说明这些方面如何解释 wchar_t char16_t char32_t char8_t (读写)。

Similar claims hold in other cases. This does not mandate what format wchar_t have. It simply states how these facets interpret wchar_t or char16_t or char32_t or char8_t (reading or writing).

wchar_t 使用不同的方法来强制解释 wchar_t 的值。

Other ways of interacting with wchar_t use different methods to mandate how the value of the wchar_t is interpreted.

iswalpha 使用(全局)语言环境来解释<例如,code> wchar_t 。在某些当地人中, wchar_t 可能是UCS2。在其他情况下,可能是某些疯狂的chuhulian编码,其详细信息使您可以从空间不足的角度看到新的颜色。

iswalpha uses the (global) locale to interpret the wchar_t, for example. In some locals, the wchar_t may be UCS2. In others, it might be some insane cthulian encoding whose details enable you to see a new color from out of space.

明确地说:编码不是数据或位的属性。编码是数据解释的属性。通常,对数据只有一种 proper reasonable 解释是有意义的,但是数据本身就是位。

To be explicit: encodings are not the property of data, or bits. Encodings are properties of interpretation of data. Quite often there is only one proper or reasonable interpretation of data that makes any sense, but the data itself is bits.

C ++标准不要求存储在 wchar_t 中的内容。它确实要求某些操作解释 wchar_t 的内容。该部分描述了一些方面如何解释 wchar_t 中的数据。

The C++ standard does not mandate what is stored in a wchar_t. It does mandate what certain operations interpret the contents of a wchar_t to be. That section describes how some facets interpret the data in a wchar_t.

这篇关于C ++标准是否要求对wchar_t进行编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆