C ++ 0x是否支持std :: wstring到UTF-8字节序列的转换? [英] Does C++0x support std::wstring conversion to/from UTF-8 byte sequence?

查看:131
本文介绍了C ++ 0x是否支持std :: wstring到UTF-8字节序列的转换?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我看到C ++ 0x将添加对UTF-8,UTF-16和UTF-32文字的支持.但是,这三种表示形式之间的转换又如何呢?

我计划在代码中的任何地方使用std :: wstring.但是在处理文件和网络时,我还需要处理UTF-8编码的数据. C ++ 0x还将提供对这些操作的支持吗?

解决方案

在C ++ 0x中,char16_tchar32_t将用于存储UTF-16和UTF-32,而不是wchar_t.

摘自n2798草案:

22.2.1.4类模板编解码器

2当从一个代码集转换为另一种代码集(例如从宽字符到多字节字符或在宽字符编码(例如Unicode和 EUC.

3表76(22.1.1.1.1)所需的专业化将实现转换为- 定义的本机字符集.编解码器实现了简并的 转换;它根本不转换.专业化codecvt<char16_t, char, mbstate_t>在UTF-16和UTF-8编码方案之间进行转换,并且 专门化codecvt <char32_t, char, mbstate_t>在UTF-32和 UTF-8编码方案. codecvt<wchar_t,char,mbstate_t>在本机之间转换 窄字符和宽字符的字符集. mbstate_t的专业化表现 库实现者已知的编码之间的转换.

可以通过专注于用户定义的stateT类型来转换其他编码. stateT对象可以包含可用于与专门的do_in或从专门的do_in进行通信的任何状态. do_out成员.

关于wchar_t事情是,它不能为您提供所用编码的任何保证.它是一种可以容纳多字节字符的类型.时期.如果要编写软件 now ,则必须忍受这种折衷.符合C ++ 0x的编译器还相距甚远.您总是可以尝试VC2010 CTP和g ++编译器,以了解它的价值.此外,wchar_t在不同平台上具有不同的大小,这是另外要注意的问题(在VS/Windows上为2字节,在GCC/Mac上为4字节,依此类推).然后,对于GCC,有类似-fshort-wchar的选项使问题进一步复杂化.

因此,最好的解决方案是使用现有库.追逐UNICODE错误并不是最好的精力/时间使用方式.我建议您看看:

有关C ++ 0x Unicode字符串文字的更多信息此处

I saw that C++0x will add support for UTF-8, UTF-16 and UTF-32 literals. But what about conversions between the three representations ?

I plan to use std::wstring everywhere in my code. But I also need to manipulate UTF-8 encoded data when dealing with files and network. Will C++0x provide also support for these operations ?

解决方案

In C++0x, char16_t and char32_t will be used to store UTF-16 and UTF-32 and not wchar_t.

From the draft n2798:

22.2.1.4 Class template codecvt

2 The class codecvt is for use when converting from one codeset to another, such as from wide characters to multibyte characters or between wide character encodings such as Unicode and EUC.

3 The specializations required in Table 76 (22.1.1.1.1) convert the implementation- defined native character set. codecvt implements a degenerate conversion; it does not convert at all. The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encodings schemes, and the specialization codecvt <char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encodings schemes. codecvt<wchar_t,char,mbstate_t> converts between the native character sets for narrow and wide characters. Specializations on mbstate_t perform conversion between encodings known to the library implementor.

Other encodings can be converted by specializing on a user-defined stateT type. The stateT object can contain any state that is useful to communicate to or from the specialized do_in or do_out members.

The thing about wchar_t is that it does not give you any guarantees about the encoding used. It is a type that can hold a multibyte character. Period. If you are going to write software now, you have to live with this compromise. C++0x compliant compilers are yet a far cry. You can always give the VC2010 CTP and g++ compilers a try for what it is worth. Moreover, wchar_t has different sizes on different platforms which is another thing to watch out for (2 bytes on VS/Windows, 4 bytes on GCC/Mac and so on). There is then options like -fshort-wchar for GCC to further complicate the issue.

The best solution therefore is to use an existing library. Chasing UNICODE bugs around isn't the best possible use of effort/time. I'd suggest you take a look at:

More on C++0x Unicode string literals here

这篇关于C ++ 0x是否支持std :: wstring到UTF-8字节序列的转换?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆