wchar_t可以确切代表什么? [英] What exactly can wchar_t represent?

查看:152
本文介绍了wchar_t可以确切代表什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 cppreference.com在 wchar_t


wchar_t -用于宽字符表示的类型(请参见宽字符串)。必须足够大以表示任何受支持的字符代码点(在支持Unicode的系统上为32位。Windows是一个明显的例外,其中wchar_t是16位,并保存UTF-16代码单元)它具有相同的大小,符号和对齐方式作为整数类型之一,但是不同的类型。


标准在 [basic.fundamental] / 5


类型 wchar_t 是一种不同的类型,其值可以代表最大扩展字符的所有成员的不同代码在支持的语言环境中指定的设置。类型 wchar_­t 必须具有与其他整数类型之一(称为基础类型)相同的大小,符号和对齐要求。类型 char16_t char32_t 表示与 uint_least16_t相同的大小,符号和对齐方式的不同类型 uint_least32_t 分别位于< cstdint> 中,称为基础类型。 / p>

因此,如果我想处理Unicode字符,应该使用 wchar_t



等效地,我怎么知道是否支持 是由 wchar_t

解决方案


因此,如果我想处理Unicode字符,是否应该使用
wchar_t


wchar_t
一样使用 char 表示Unicode字符-您只需要记住最多4个 char 一起将形成有效的代码点,具体取决于UTF-8,UTF-16或UTF-32编码,而 wchar_t 可以使用1个(在Linux上为UTF-32等)或最多2个可以一起使用(在Windows上为UTF-16)。



接下来,没有确定的Unicode编码。一些Unicode编码使用固定宽度来表示代码点(例如UTF-32),而其他一些(例如UTF-8和UTF-16)则具有可变的长度(例如字母'a'肯定会用完1个字节,但是分开从英文字母开始,其他字符肯定会占用更多字节来表示)。



因此,您必须决定要代表哪种字符,然后相应地选择编码。根据要表示的字符类型,这将影响数据占用的字节数。例如。使用UTF-32表示大多数英文字符将导致许多0字节。对于许多基于拉丁语的语言,UTF-8是更好的选择,而对于东亚语言,UTF-16通常是更好的选择。



一旦您决定了这一点,您应该尽量减少转化次数,并与您的决定保持一致。



在下一步中,您可以确定哪种数据类型适合表示数据(或哪种类型的数据)。



如果您想在代码点的基础上进行文本操作/解释,请 char 如果您有例如,当然不是要走的路日本汉字。但是,如果您只是想交流数据而不再将其视为量化的字节序列,则可以使用 char



到处都是UTF-8的链接已经发布为评论,我建议您看看还有。另一个很好的读物是每个程序员应该了解的编码



到目前为止,C ++仅支持Unicode的基本语言(例如 char16_t char32_t 数据类型和 u8 / u / U 字面量前缀) 。因此,选择一个库来管理编码(尤其是转换)无疑是一个好建议。


According to cppreference.com's doc on wchar_t:

wchar_t - type for wide character representation (see wide strings). Required to be large enough to represent any supported character code point (32 bits on systems that support Unicode. A notable exception is Windows, where wchar_t is 16 bits and holds UTF-16 code units) It has the same size, signedness, and alignment as one of the integer types, but is a distinct type.

The Standard says in [basic.fundamental]/5:

Type wchar_­t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales. Type wchar_­t shall have the same size, signedness, and alignment requirements as one of the other integral types, called its underlying type. Types char16_­t and char32_­t denote distinct types with the same size, signedness, and alignment as uint_­least16_­t and uint_­least32_­t, respectively, in <cstdint>, called the underlying types.

So, if I want to deal with unicode characters, should I use wchar_t?

Equivalently, how do I know if a specific unicode character is "supported" by wchar_t?

解决方案

So, if I want to deal with unicode characters, should I use wchar_t?

First of all, note that the encoding does not force you to use any particular type to represent a certain character. You may use char to represent Unicode characters just as wchar_t can - you only have to remember that up to 4 chars together will form a valid code point depending on UTF-8, UTF-16, or UTF-32 encoding, while wchar_t can use 1 (UTF-32 on Linux, etc) or up to 2 working together (UTF-16 on Windows).

Next, there is no definite Unicode encoding. Some Unicode encodings use a fixed width for representing codepoints (like UTF-32), others (such as UTF-8 and UTF-16) have variable lengths (the letter 'a' for instance surely will just use up 1 byte, but apart from the English alphabet, other characters surely will use up more bytes for representation).

So you have to decide what kind of characters you want to represent and then choose your encoding accordingly. Depending on the kind of characters you want to represent, this will affect the amount of bytes your data will take. E.g. using UTF-32 to represent mostly English characters will lead to many 0-bytes. UTF-8 is a better choice for many Latin based languages, while UTF-16 is usually a better choice for Eastern Asian languages.

Once you have decided on this, you should minimize the amount of conversions and stay consistent with your decision.

In the next step, you may decide what data type is appropriate to represent the data (or what kind of conversions you may need).

If you would like to do text-manipulation/interpretation on a code-point basis, char certainly is not the way to go if you have e.g. Japanese kanji. But if you just want to communicate your data and regard it no more as a quantitative sequence of bytes, you may just go with char.

The link to UTF-8 everywhere was already posted as a comment, and I suggest you having a look there as well. Another good read is What every programmer should know about encodings.

As by now, there is only rudimentary language support in C++ for Unicode (like the char16_t and char32_t data types, and u8/u/U literal prefixes). So chosing a library for manging encodings (especially conversions) certainly is a good advice.

这篇关于wchar_t可以确切代表什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆