标准方式在C11和C ++ 11转换UTF-8? [英] Standard way in C11 and C++11 to convert UTF-8?
问题描述
C11和C ++ 11都引入了 uchar.h
/ cuchar
> char16_t 和 char32_t
作为明确的16位和32位宽字符,添加文字语法 u
和 U
用于写入带有这些字符类型的字符串以及宏 __ STDC_UTF_16 __
和 __ STDC_UTF_32 __
,告诉您它们是否对应于UTF-16和UTF-32代码单元。这有助于消除有关 wchar_t
的不确定性,在某些平台上它是16位的,通常用于保存UTF-16代码单元,在某些平台上是32位的,通常用于保存UTF-32代码单元;假设现在设置了这些宏,则现在可以编写引用UTF-16和UTF-32的可移植,明确的代码。 __ STDC_ISO_10646 __
也可以用作代理来确定 wchar_t
是否能够保存UTF-32值;如果它不能,你不一定可以假定它保持UTF-16,但它可能是一个足够近似的便携式。
C11 and C++11 both introduce the uchar.h
/cuchar
header defining char16_t
and char32_t
as explicitly 16 and 32 bit wide characters, added literal syntax u""
and U""
for writing strings with these character types, along with macros __STDC_UTF_16__
and __STDC_UTF_32__
that tell you whether or not they correspond to UTF-16 and UTF-32 code units. This helps remove the ambiguity about wchar_t
, which on some platforms was 16 bit and generally used to hold UTF-16 code units, and on some platforms was 32 bit and generally used to hold UTF-32 code units; assuming those macros are now set, you can now write portable, unambiguous code referring to UTF-16 and UTF-32. __STDC_ISO_10646__
can also be used as a proxy to determine whether wchar_t
is capable of holding UTF-32 values; if it can't, you can't necessarily assume that it holds UTF-16, but it's probably a close enough approximation to be portable.
他们还添加了函数 mbrtoc16
, mbrtoc32
, c16rtomb
和 c32rtomb
用于在多字节字符和这些类型之间进行转换。在这些和现有的 mbstowcs
系列函数之间,可以在UTF-16,UTF-32,平台多字节字符集和平台宽字符集之间转换虽然不一定是无损的,除非平台定义的多字节和宽字符集是UTF;特别地,似乎这些函数在Windows上是相当无用的,其中语言环境定义的多字节编码不允许每个字符使用多于两个字节)。
They also add the functions mbrtoc16
, mbrtoc32
, c16rtomb
, and c32rtomb
for converting between multibyte characters and these types. Between these and the existing mbstowcs
family of functions, it's possible to translate between UTF-16, UTF-32, the platform multibyte character set, and the platform wide character set portably (though not necessarily losslessly unless the platform defined multibyte and wide character sets are UTFs; in particular, it seems like these functions will be fairly useless on Windows where the locale defined multibyte encoding is not allowed to use more than two bytes per character).
此外,他们添加了用于编写文字UTF-8编码字符串的 u8
语法。由于UTF-8是与处理 char *
和 std :: string
的大多数函数兼容的编码,这是最有用的新添加项之一。
Furthermore, they added the u8""
syntax for writing literal UTF-8 encoded strings. As UTF-8 is an encoding that is compatible with most functions that deal in char *
and std::string
, this is one of the most useful new additions.
然而,他们似乎没有添加任何方式在UTF-8,UTF-16和UTF之间可移植地转换-32。 mbtoc16
和相关函数在实现定义的多字节编码和UTF-16或32之间进行转换;但你不能依赖于这是UTF-8。在类Unix平台上,它依赖于区域设置,并且其中许多平台默认使用UTF-8,即使它不是默认值,你也可以至少将区域设置为UTF-8区域设置,以便了解多字节是指UTF-8。但是,在Windows上,显式不能使用UTF-8或任何其他需要超过两个字节的编码对于语言环境。
However, they seem to have failed to add any way to portably convert between UTF-8, UTF-16, and UTF-32. The mbtoc16
and related functions convert between the implementation defined multibyte encoding and UTF-16 or 32; but you can't depend on this being UTF-8. On Unix-like platforms it's dependent on the locale, and many of them use UTF-8 in their locale by default, and even if it's not the default you can at least set the locale to a UTF-8 locale for the purposes of knowing that "multibyte" means UTF-8. On Windows, however, you explicitly can't use UTF-8 or any other encoding that requires more than two bytes for the locale.
我只是缺少一些东西,或是UTF-8字符串类型没有伴随任何方式将其转换为其他类型字符串:平台定义的多字节,平台定义的宽字符,UTF-16或UTF-32?有没有办法甚至告诉你的系统多字节编码是否是UTF-8?是否有任何理由为什么不包括这种支持(具体来说,我正在寻找实际书面的理由或C或C ++标准委员会讨论,而不只是投机)?是否正在进行任何改进这种情况的工作;
Am I just missing something, or is the UTF-8 string type not accompanied by any way to convert it to the other types of strings: platform defined multibyte, platform defined wide char, UTF-16, or UTF-32? Is there no way to even tell if your system multibyte encoding is UTF-8? Is there any reason why this support wasn't included (specifically, I'm looking for actually written justification or discussion by the C or C++ standards committees, not just speculation)? Is there any work being done to improve this situation; is it likely to improve in the future?
或者,是目前最好的解决方案,如果你想以可移植的方式支持UTF-8,你可以编写自己的实现,拉入库依赖或使用平台特定的函数,如 iconv
和 MultiByteToWideChar
?
Or, is the current best solution, if you want to support UTF-8 in a portable fashion, to write your own implementation, pull in a library dependency, or use platform-specific functions like iconv
and MultiByteToWideChar
?
推荐答案
听起来像是在寻找 std :: codecvt 类型。请参阅该网页上的使用示例。
Sounds like you're looking for the std::codecvt type. See the example on that page for usage.
这篇关于标准方式在C11和C ++ 11转换UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!