标准方式在C11和C ++ 11转换UTF-8? [英] Standard way in C11 and C++11 to convert UTF-8?

查看:254
本文介绍了标准方式在C11和C ++ 11转换UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

C11和C ++ 11都引入了 uchar.h / cuchar > char16_t 和 char32_t 作为明确的16位和32位宽字符,添加文字语法 u U用于写入带有这些字符类型的字符串以及宏 __ STDC_UTF_16 __ __ STDC_UTF_32 __ ,告诉您它们是否对应于UTF-16和UTF-32代码单元。这有助于消除有关 wchar_t 的不确定性,在某些平台上它是16位的,通常用于保存UTF-16代码单元,在某些平台上是32位的,通常用于保存UTF-32代码单元;假设现在设置了这些宏,则现在可以编写引用UTF-16和UTF-32的可移植,明确的代码。 __ STDC_ISO_10646 __ 也可以用作代理来确定 wchar_t 是否能够保存UTF-32值;如果它不能,你不一定可以假定它保持UTF-16,但它可能是一个足够近似的便携式。

C11 and C++11 both introduce the uchar.h/cuchar header defining char16_t and char32_t as explicitly 16 and 32 bit wide characters, added literal syntax u"" and U"" for writing strings with these character types, along with macros __STDC_UTF_16__ and __STDC_UTF_32__ that tell you whether or not they correspond to UTF-16 and UTF-32 code units. This helps remove the ambiguity about wchar_t, which on some platforms was 16 bit and generally used to hold UTF-16 code units, and on some platforms was 32 bit and generally used to hold UTF-32 code units; assuming those macros are now set, you can now write portable, unambiguous code referring to UTF-16 and UTF-32. __STDC_ISO_10646__ can also be used as a proxy to determine whether wchar_t is capable of holding UTF-32 values; if it can't, you can't necessarily assume that it holds UTF-16, but it's probably a close enough approximation to be portable.

他们还添加了函数 mbrtoc16 mbrtoc32 c16rtomb c32rtomb 用于在多字节字符和这些类型之间进行转换。在这些和现有的 mbstowcs 系列函数之间,可以在UTF-16,UTF-32,平台多字节字符集和平台宽字符集之间转换虽然不一定是无损的,除非平台定义的多字节和宽字符集是UTF;特别地,似乎这些函数在Windows上是相当无用的,其中语言环境定义的多字节编码不允许每个字符使用多于两个字节)。

They also add the functions mbrtoc16, mbrtoc32, c16rtomb, and c32rtomb for converting between multibyte characters and these types. Between these and the existing mbstowcs family of functions, it's possible to translate between UTF-16, UTF-32, the platform multibyte character set, and the platform wide character set portably (though not necessarily losslessly unless the platform defined multibyte and wide character sets are UTFs; in particular, it seems like these functions will be fairly useless on Windows where the locale defined multibyte encoding is not allowed to use more than two bytes per character).

此外,他们添加了用于编写文字UTF-8编码字符串的 u8语法。由于UTF-8是与处理 char * std :: string 的大多数函数兼容的编码,这是最有用的新添加项之一。

Furthermore, they added the u8"" syntax for writing literal UTF-8 encoded strings. As UTF-8 is an encoding that is compatible with most functions that deal in char * and std::string, this is one of the most useful new additions.

然而,他们似乎没有添加任何方式在UTF-8,UTF-16和UTF之间可移植地转换-32。 mbtoc16 和相关函数在实现定义的多字节编码和UTF-16或32之间进行转换;但你不能依赖于这是UTF-8。在类Unix平台上,它依赖于区域设置,并且其中许多平台默认使用UTF-8,即使它不是默认值,你也可以至少将区域设置为UTF-8区域设置,以便了解多字节是指UTF-8。但是,在Windows上,显式不能使用UTF-8或任何其他需要超过两个字节的编码对于语言环境

However, they seem to have failed to add any way to portably convert between UTF-8, UTF-16, and UTF-32. The mbtoc16 and related functions convert between the implementation defined multibyte encoding and UTF-16 or 32; but you can't depend on this being UTF-8. On Unix-like platforms it's dependent on the locale, and many of them use UTF-8 in their locale by default, and even if it's not the default you can at least set the locale to a UTF-8 locale for the purposes of knowing that "multibyte" means UTF-8. On Windows, however, you explicitly can't use UTF-8 or any other encoding that requires more than two bytes for the locale.

我只是缺少一些东西,或是UTF-8字符串类型没有伴随任何方式将其转换为其他类型字符串:平台定义的多字节,平台定义的宽字符,UTF-16或UTF-32?有没有办法甚至告诉你的系统多字节编码是否是UTF-8?是否有任何理由为什么不包括这种支持(具体来说,我正在寻找实际书面的理由或C或C ++标准委员会讨论,而不只是投机)?是否正在进行任何改进这种情况的工作;

Am I just missing something, or is the UTF-8 string type not accompanied by any way to convert it to the other types of strings: platform defined multibyte, platform defined wide char, UTF-16, or UTF-32? Is there no way to even tell if your system multibyte encoding is UTF-8? Is there any reason why this support wasn't included (specifically, I'm looking for actually written justification or discussion by the C or C++ standards committees, not just speculation)? Is there any work being done to improve this situation; is it likely to improve in the future?

或者,是目前最好的解决方案,如果你想以可移植的方式支持UTF-8,你可以编写自己的实现,拉入库依赖或使用平台特定的函数,如 iconv MultiByteToWideChar

Or, is the current best solution, if you want to support UTF-8 in a portable fashion, to write your own implementation, pull in a library dependency, or use platform-specific functions like iconv and MultiByteToWideChar?

推荐答案

听起来像是在寻找 std :: codecvt 类型。请参阅该网页上的使用示例。

Sounds like you're looking for the std::codecvt type. See the example on that page for usage.

这篇关于标准方式在C11和C ++ 11转换UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆