在字符串之间转换,u16string& u32string [英] Convert between string, u16string & u32string

查看:1020
本文介绍了在字符串之间转换,u16string& u32string的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在寻找一种在Unicode字符串类型之间进行转换的方法,并且遇到了此方法。不仅我不完全理解方法(没有意见),而且该文章暗示将来会有更好的方法。



如果这是最好的方法

解决方案

code> mbstowcs()和 wcstombs()不一定转换为UTF-16或UTF- code> wchar_t ,而不管locale wchar_t 编码是什么。所有Windows区域设置使用两个字节 wchar_t 和UTF-16作为编码,但其他主要平台使用4字节 wchar_t 与UTF-32(或甚至一些非Unicode编码的某些区域设置)。一个只支持单字节编码的平台甚至可以有一个字节 wchar_t ,并且编码因区域而异。所以 wchar_t 在我看来是一个不好的选择的可移植性和Unicode。 *



在C ++ 11中引入了一些更好的选项;



首先,使用codecvt的新模板类为std :: codecvt,新的codecvt类和一个新的模板,使用它们进行转换非常方便。 :: wstring_convert。一旦你创建了一个std :: wstring_convert类的实例,你可以很容易地在字符串之间转换:

  std :: wstring_convert< ...>兑换; // ...用codecvt填充以做UTF-8<  - > UTF-16 
std :: string utf8_string = u8此字符串具有UTF-8内容;
std :: u16string utf16_string = convert.from_bytes(utf8_string);
std :: string another_utf8_string = convert.to_bytes(utf16_string);

为了做不同的转换,你只需要不同的模板参数,其中之一是codecvt facet。这里有一些方便使用wstring_convert的新方面:

  std :: codecvt_utf8_utf16< char16_t> //在UTF-8<  - >之间转换。 UTF-16 
std :: codecvt_utf8< char32_t> //在UTF-8< - >之间转换。 UTF-32
std :: codecvt_utf8< char16_t> //在UTF-8< - >之间转换。 UCS-2(警告,不是UTF-16!不要打扰这个)

使用这些:

  std :: wstring_convert< std :: codecvt_utf8_utf16< char16_t>,char16_t>兑换; 
std :: string a = convert.to_bytes(u此字符串具有UTF-16内容);
std :: u16string b = convert.from_bytes(u8blah blah blah);

新的std :: codecvt特性更难使用,因为它们有一个受保护的析构函数。为了解决这个问题,你可以定义一个有析构函数的子类,也可以使用std :: use_facet模板函数获取一个现有的codecvt实例。此外,这些专业化的一个问题是你不能在Visual Studio 2010中使用它们,因为模板专门化不能用typedef'd类型,编译器定义char16_t和char32_t作为typedefs。下面是一个定义你自己的codecvt子类的例子:

  template< class internT,class externT,class stateT& 
struct codecvt:std :: codecvt< internT,externT,stateT>
{〜codecvt(){}};

std :: wstring_convert< codecvt< char16_t,char,std :: mbstate_t>,char16_t> convert16;
std :: wstring_convert< codecvt< char32_t,char,std :: mbstate_t>,char32_t> convert32;

char16_t专业化在UTF-16和UTF-8之间转换。 char32_t专业化,UTF-32和UTF-8。



请注意,C ++ 11提供的这些新转换不包括任何方式直接在UTF- 32和UTF-16。而是你只需要结合std :: wstring_convert的两个实例。






* 我将添加一个关于wchar_t及其目的的注释,以强调为什么它通常不应该用于Unicode或便携式国际化代码。以下是我的回答的简短版本 http://stackoverflow.com/a/11107667/365496



什么是wchar_t?



wchar_t定义为任何语言环境的char编码都可以转换为wchar_t, codepoint:


键入wchar_t是一个不同的类型,其值可以表示支持的语言环境中指定的最大扩展字符集的所有成员的不同代码(22.3.1)。 - [basic.fundamental] 3.9.1 / 5


em>要求wchar_t足够大,以便同时表示所有语言环境中的任何字符。也就是说,用于wchar_t的编码在区域设置之间可能不同。这意味着你不一定使用一个语言环境将字符串转换为wchar_t,然后使用另一个语言环境转换回char。



由于这似乎是实践中的主要用途wchar_t你可能会想知道什么是好的,如果不是这样。



wchar_t的初衷和目的是通过定义它使文本处理简单,从字符串的代码单元到文本的字符的一对一映射,因此允许使用与ascii字符串使用的相同的简单算法与其他语言一起工作。



不幸的是对wchar_t的要求假定字符和代码点之间的一对一映射以实现这一点。 Unicode破坏了这个假设,因此你不能安全地对简单文本算法使用wchar_t。



这意味着便携式软件不能使用wchar_t作为文本的通用表示或者使用简单的文本算法。



今天wchar_t有什么用?

,对于便携式代码反正。如果定义 __ STDC_ISO_10646 __ ,则wchar_t的值直接表示所有语言环境中具有相同值的Unicode代码点。这使得可以安全地执行前面提到的语言环境转换。然而,你不能仅仅依靠它来决定你可以使用wchar_t这种方式,因为虽然大多数unix平台定义它,即使Windows在所有区域设置中使用相同的wchar_t语言环境。



Windows不会定义 __ STDC_ISO_10646 __ 的原因我认为是因为Windows使用UTF-16作为其wchar_t编码,并且因为UTF-16使用代理对以表示大于U + FFFF的码点,这意味着UTF-16不满足 __ STDC_ISO_10646 __ 的要求。



对于平台特定的代码wchar_t可能更有用。它基本上需要在Windows上(例如,一些文件根本不能打开没有使用wchar_t文件名),虽然Windows是唯一的平台,这是真的,就我所知(所以也许我们可以认为wchar_t为'Windows_char_t')。 / p>

事后看来,wchar_t对于简化文本处理或作为区域设置独立文本的存储显然没有用。便携式代码不应尝试将其用于这些用途。


I've been looking for a way to convert between the Unicode string types and came across this method. Not only do I not completely understand the method (there are no comments) but also the article implies that in future there will be better methods.

If this is the best method, could you please point out what makes it work, and if not I would like to hear suggestions for better methods.

解决方案

mbstowcs() and wcstombs() don't necessarily convert to UTF-16 or UTF-32, they convert to wchar_t and whatever the locale wchar_t encoding is. All Windows locales uses a two byte wchar_t and UTF-16 as the encoding, but the other major platforms use a 4-byte wchar_t with UTF-32 (or even a non-Unicode encoding for some locales). A platform that only supports single-byte encodings could even have a one byte wchar_t and have the encoding differ by locale. So wchar_t seems to me to be a bad choice for portability and Unicode. *

Some better options have been introduced in C++11; new specializations of std::codecvt, new codecvt classes, and a new template to make using them for conversions very convienent.

First the new template class for using codecvt is std::wstring_convert. Once you've created an instance of a std::wstring_convert class you can easily convert between strings:

std::wstring_convert<...> convert; // ... filled in with a codecvt to do UTF-8 <-> UTF-16
std::string utf8_string = u8"This string has UTF-8 content";
std::u16string utf16_string = convert.from_bytes(utf8_string);
std::string another_utf8_string = convert.to_bytes(utf16_string);

In order to do different conversion you just need different template parameters, one of which is a codecvt facet. Here are some new facets that are easy to use with wstring_convert:

std::codecvt_utf8_utf16<char16_t> // converts between UTF-8 <-> UTF-16
std::codecvt_utf8<char32_t> // converts between UTF-8 <-> UTF-32
std::codecvt_utf8<char16_t> // converts between UTF-8 <-> UCS-2 (warning, not UTF-16! Don't bother using this one)

Examples of using these:

std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string a = convert.to_bytes(u"This string has UTF-16 content");
std::u16string b = convert.from_bytes(u8"blah blah blah");

The new std::codecvt specializations are a bit harder to use because they have a protected destructor. To get around that you can define a subclass that has a destructor, or you can use the std::use_facet template function to get an existing codecvt instance. Also, an issue with these specializations is you can't use them in Visual Studio 2010 because template specialization doesn't work with typedef'd types and that compiler defines char16_t and char32_t as typedefs. Here's an example of defining your own subclass of codecvt:

template <class internT, class externT, class stateT>
struct codecvt : std::codecvt<internT,externT,stateT>
{ ~codecvt(){} };

std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>,char16_t> convert16;
std::wstring_convert<codecvt<char32_t,char,std::mbstate_t>,char32_t> convert32;

The char16_t specialization converts between UTF-16 and UTF-8. The char32_t specialization, UTF-32 and UTF-8.

Note that these new conversions provided by C++11 don't include any way to convert directly between UTF-32 and UTF-16. Instead you just have to combine two instances of std::wstring_convert.


* I thought I'd add a note on wchar_t and its purpose, to emphasize why it should not generally be used for Unicode or portable internationalized code. The following is a short version of my answer http://stackoverflow.com/a/11107667/365496

What is wchar_t?

wchar_t is defined such that any locale's char encoding can be converted to wchar_t where every wchar_t represents exactly one codepoint:

Type wchar_t is a distinct type whose values can represent distinct codes for all members of the largest extended character set specified among the supported locales (22.3.1). -- [basic.fundamental] 3.9.1/5

This does not require that wchar_t be large enough to represent any character from all locales simultaneously. That is, the encoding used for wchar_t may differ between locales. Which means that you cannot necessarily convert a string to wchar_t using one locale and then convert back to char using another locale.

Since that seems to be the primary use in practice for wchar_t you might wonder what it's good for if not that.

The original intent and purpose of wchar_t was to make text processing simple by defining it such that it requires a one-to-one mapping from a string's code-units to the text's characters, thus allowing the use of same simple algorithms used with ascii strings to work with other languages.

Unfortunately the requirements on wchar_t assume a one-to-one mapping between characters and codepoints to achieve this. Unicode breaks that assumption, so you can't safely use wchar_t for simple text algorithms either.

This means that portable software cannot use wchar_t either as a common representation for text between locales, or to enable the use of simple text algorithms.

What use is wchar_t today?

Not much, for portable code anyway. If __STDC_ISO_10646__ is defined then values of wchar_t directly represent Unicode codepoints with the same values in all locales. That makes it safe to do the inter-locale conversions mentioned earlier. However you can't rely only on it to decide that you can use wchar_t this way because, while most unix platforms define it, Windows does not even though Windows uses the same wchar_t locale in all locales.

The reason Windows doesn't define __STDC_ISO_10646__ I think is because Windows use UTF-16 as its wchar_t encoding, and because UTF-16 uses surrogate pairs to represent codepoints greater than U+FFFF, which means that UTF-16 doesn't satisfy the requirements for __STDC_ISO_10646__.

For platform specific code wchar_t may be more useful. It's essentially required on Windows (e.g., some files simply cannot be opened without using wchar_t filenames), though Windows is the only platform where this is true as far as I know (so maybe we can think of wchar_t as 'Windows_char_t').

In hindsight wchar_t is clearly not useful for simplifying text handling, or as storage for locale independent text. Portable code should not attempt to use it for these purposes.

这篇关于在字符串之间转换,u16string&amp; u32string的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆