什么是“错误”用C ++ wchar_t和wstrings?什么是宽字符的一些替代品? [英] What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?

查看:166
本文介绍了什么是“错误”用C ++ wchar_t和wstrings?什么是宽字符的一些替代品?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经看到很多人在C ++社区(特别是在freenode的## c ++)重新使用 wstrings wchar_t ,以及它们在windows api中的使用。什么是错误与 wchar_t wstring ,如果我想支持国际化,宽字符?

解决方案

什么是wchar_t?



wchar_t这样任何语言环境的char编码可以转换为wchar_t表示,其中每个wchar_t只代表一个代码点:


类型wchar_t是一个不同的类型,值可以表示在受支持的区域设置(22.3.1)中指定的最大扩展字符集的所有成员的不同代码。



                                                                                 > - C ++ [basic.fundamental] 3.9.1 / 5


要求wchar_t足够大以同时表示所有语言环境中的任何字符。也就是说,用于wchar_t的编码在区域设置之间可能不同。这意味着你不必使用一个语言环境将字符串转换为wchar_t,然后使用另一个语言环境转换回char。 1



由于使用wchar_t作为所有语言环境之间的共同表示似乎是wchar_t的主要用途在实践中你可能想知道什么是好的,如果不是这样。



wchar_t的原始意图和目的是通过定义它使得文本处理简单,使得它需要从字符串的代码单元到文本的字符的一对一映射,从而允许使用与ascii字符串一起使用的相同的简单算法与其他



不幸的是,wchar_t的规范假设字符和代码点之间有一个一对一的映射来实现这一点。 Unicode会破坏假设 2 ,因此您不能安全地对简单文本算法使用wchar_t。



这意味着可移植软件不能使用wchar_t



今天wchar_t的用途是什么?



没有多少,对于便携式代码反正。如果定义 __ STDC_ISO_10646 __ ,则wchar_t的值直接表示所有语言环境中具有相同值的Unicode代码点。这使得可以安全地执行前面提到的语言环境转换。然而,你不能仅仅依靠它来决定你可以使用wchar_t这种方式,因为虽然大多数unix平台定义它,即使Windows在所有区域设置中使用相同的wchar_t语言环境。



Windows未定义 __ STDC_ISO_10646 __ 的原因是Windows使用UTF-16作为其wchar_t编码,并且因为UTF-16使用代理对来表示大于U + FFFF的码点,这意味着UTF-16不满足 __ STDC_ISO_10646 __ 的要求。



对于平台特定的代码wchar_t可能更有用。它基本上需要在Windows上(例如,一些文件根本不能打开没有使用wchar_t文件名),虽然Windows是唯一的平台,这是真的,就我所知(所以也许我们可以认为wchar_t为'Windows_char_t')。 / p>

事后看来,wchar_t对于简化文本处理或作为区域设置独立文本的存储显然没有用。便携式代码不应尝试将其用于这些目的。



替代方案



我喜欢的替代方案是使用UTF-8编码的C字符串,甚至在对UTF-8不是特别友好的平台上。



这样,可以使用跨平台的通用文本表示,使用标准数据类型用于它们的预期目的,获得语言对这些类型的支持(例如,字符串字面量,虽然一些技巧对于某些编译器是必要的),一些标准库支持,调试器支持(可能需要更多的技巧)对于宽字符,通常更难或不可能得到所有这些,你可以在不同的平台上获得不同的片段。



UTF-8不提供的一件事是能够使用简单的文本算法,如可能与ASCII。在这个UTF-8没有比任何其他Unicode编码差。事实上,它可以被认为是更好的,因为UTF-8中的多代码单元表示更常见,因此在代码处理这样的字符的可变宽度表示中的错误更可能被注意到和固定,如果你试图坚持UTF -32与NFC或NFKC。



许多平台使用UTF-8作为其本地char编码,许多程序不需要任何重要的文本处理,因此编写国际化程序在这些平台上与写代码没有什么不同,不考虑国际化。编写更广泛的可移植代码或在其他平台上编写需要在使用其他编码的API的边界插入转换。



某些软件使用的另一种替代方法是选择交叉 - 平台表示,例如保存UTF-16数据的无符号短数组,然后提供所有的库支持,只需支付语言支持费用等。



C ++ 11添加新种类的宽字符作为wchar_t,char16_t和char32_t的替代,伴随附带的语言/库功能。这些实际上并不保证是UTF-16和UTF-32,但我不想象任何主要的实现将使用任何其他。 C ++ 11还改进了UTF-8的支持,例如使用UTF-8字符串字面量,所以它不需要欺骗VC ++生成UTF-8编码字符串(虽然我可以继续这样做,而不是使用<$

TCHAR:

TCHAR用于迁移古老的Windows程序,假定从char到wchar_t的传统编码,并且最好被遗忘,除非你的程序是在以前的千年写的。它不可移植,并且对其编码,甚至其数据类型固有地不具体,使其不能用于任何非基于TCHAR的API。因为它的目的是迁移到wchar_t,我们上面看到的不是一个好主意,使用TCHAR没有任何价值。







1。在wchar_t字符串中可表示但在任何语言环境中不支持的字符不需要用单个wchar_t值表示。这意味着wchar_t可以对某些字符使用可变宽度编码,另一个明显违反wchar_t的意图。虽然有争议的是,一个可以被wchar_t表示的字符就足以说该语言环境支持该字符,在这种情况下,可变宽度编码是不合法的,并且Window对UTF-16的使用是不一致的。



2。 Unicode允许许多字符用多个代码点表示,这对于简单文本算法与可变宽度编码产生相同的问题。即使一个严格维持组合规范化,一些字符仍然需要多个代码点。请参阅: http://www.unicode.org/standard/where/


I have seen a lot of people in the C++ community(particularly ##c++ on freenode) resent the use of wstrings and wchar_t, and their use in the windows api. What is exactly "wrong" with wchar_t and wstring, and if I want to support internationalization, what are some alternatives to wide characters?

解决方案

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to a wchar_t representation where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1).

                                                                               — C++ [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.1

Since using wchar_t as a common representation between all locales seems to be the primary use for wchar_t in practice you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of the same simple algorithms as are used with ascii strings to work with other languages.

Unfortunately the wording of wchar_t's specification assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption2, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes. Non-portable code may find it useful simply because some API requires it.

Alternatives

The alternative I like is to use UTF-8 encoded C strings, even on platforms not particularly friendly toward UTF-8.

This way one can write portable code using a common text representation across platforms, use standard datatypes for their intended purpose, get the language's support for those types (e.g. string literals, though some tricks are necessary to make it work for some compilers), some standard library support, debugger support (more tricks may be necessary), etc. With wide characters it's generally harder or impossible to get all of this, and you may get different pieces on different platforms.

One thing UTF-8 does not provide is the ability to use simple text algorithms such as are possible with ASCII. In this UTF-8 is no worse than any other Unicode encoding. In fact it may be considered to be better because multi-code unit representations in UTF-8 are more common and so bugs in code handling such variable width representations of characters are more likely to be noticed and fixed than if you try to stick to UTF-32 with NFC or NFKC.

Many platforms use UTF-8 as their native char encoding and many programs do not require any significant text processing, and so writing an internationalized program on those platforms is little different from writing code without considering internationalization. Writing more widely portable code, or writing on other platforms requires inserting conversions at the boundaries of APIs that use other encodings.

Another alternative used by some software is to choose a cross-platform representation, such as unsigned short arrays holding UTF-16 data, and then to supply all the library support and simply live with the costs in language support, etc.

C++11 adds new kinds of wide characters as alternatives to wchar_t, char16_t and char32_t with attendant language/library features. These aren't actually guaranteed to be UTF-16 and UTF-32, but I don't imagine any major implementation will use anything else. C++11 also improves UTF-8 support, for example with UTF-8 string literals so it won't be necessary to trick VC++ into producing UTF-8 encoded strings (although I may continue to do so rather than use the u8 prefix).

Alternatives to avoid

TCHAR: TCHAR is for migrating ancient Windows programs that assume legacy encodings from char to wchar_t, and is best forgotten unless your program was written in some previous millennium. It's not portable and is inherently unspecific about its encoding and even its data type, making it unusable with any non-TCHAR based API. Since its purpose is migration to wchar_t, which we've seen above isn't a good idea, there is no value whatsoever in using TCHAR.


1. Characters which are representable in wchar_t strings but which are not supported in any locale are not required to be represented with a single wchar_t value. This means that wchar_t could use a variable width encoding for certain characters, another clear violation of the intent of wchar_t. Although it's arguable that a character being representable by wchar_t is enough to say that the locale 'supports' that character, in which case variable-width encodings aren't legal and Window's use of UTF-16 is non-conformant.

2. Unicode allows many characters to be represented with multiple code points, which creates the same problems for simple text algorithms as variable width encodings. Even if one strictly maintains a composed normalization, some characters still require multiple code points. See: http://www.unicode.org/standard/where/

这篇关于什么是“错误”用C ++ wchar_t和wstrings?什么是宽字符的一些替代品?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆