为什么我们从 MultiByte 转换为 WideChar? [英] Why we convert from MultiByte to WideChar?

查看:34
本文介绍了为什么我们从 MultiByte 转换为 WideChar?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我习惯于处理 ASCII 字符串,但现在使用 UNICODE 我对某些术语太困惑了:

I am used to deal with ASCII strings but now with UNICODE I am too much confused about some terms:

什么是多字节字符和什么是 widechar 有什么区别?多字节是指内存中包含一个以上字节的字符,而widechar只是表示它的一种数据类型吗?

What is a multi-byte character and what is a widechar What's the difference? Does multi-byte refers to a character that contains more than one byte in memory and widechar is just a data type to represent it?

  • 为什么我们要从 MultiByteToWideCharWideCharToMultiByte 转换?
  • Why do we convert from MultiByteToWideChar and WideCharToMultiByte?

如果我声明如下:

wchar_t* wcMsg = L"مرحبا";
MessageBoxW(0, wcMsg, 0, 0);

如果我定义了 UNICODE,它会正确打印消息但是为什么我没有从 WideCharToMultiByte 在这里转换??

It prints the message correctly if I defined UNICODE But why I didn't convert here from WideCharToMultiByte??

  • 我的项目中的字符集有什么区别:_MBCSUNICODE?

MSDN 让我对Windows API"感到困惑的最后一件事是 UTF-16.

One last thing MSDN confuses me with "Windows APIs" are UTF-16.

谁能用一些例子来解释一下.一个很好的澄清真的很感激.

Can anyone explain with some examples. A good clarification is really appreciated.

推荐答案

ASCII 字符串的字符宽度为一个字节(通常为 8 位,很少为 7、9 或其他位宽).这是当时的遗留问题,当时内存大小非常小且昂贵,而且处理器通常每条指令只能处理一个字节.

An ASCII string has a char width of one byte (usually 8 bits, rarely 7, 9 or other bit widths). This is a legacy of the time, when memory size was very small and expensive, and also processors could often handle only one byte per instruction.

很容易想象,一个字节远远不足以存储世界上所有可用的字形.仅中文就有 87.000 个字形.一个字符通常只能处理 256 个字形(8 位字节).ASCII 仅定义了 96 个字形(加上较低的 32 个字符,它们被定义为不可打印的控制字符),这使它成为一个 7 位字符集.这对于英文上下字符、数字以及一些标点符号和其他字形来说已经足够了.ASCII 不使用普通 8 位字节中的最高位.

As it is easily imaginable, one byte is by far not enough to store all glyphs available on the world. Chinese alone has 87.000 glyphs. A char can usually only handle 256 glyphs (in an 8 bit byte). ASCII defines only 96 glyphs (plus the lower 32 chars, which are defined as non-printable control chars), which makes it a 7-bit charset. This is enough for English upper and lower chars, numbers and some interpunctuation and other glyphs. The highest bit in the common 8-bit byte is not used by ASCII.

要处理多于一个字节可以容纳的字形,一种方法是将基本字形存储在一个字节中,将其他常见字形存储在两个字节中,将很少使用的字形存储在 3 个甚至更多字节中.这种方法称为多字节字符集可变宽度编码.一个非常常见的例子是 UTF 8,它使用一个一个字符最多 4 个字节.它将 ASCII 字符集存储在一个字节中(因此它也向后兼容 ASCII).最高位被定义为一个开关:如果它被设置,其他字节将跟随.这同样适用于以下字节,从而形成最多 4 个字节的链".可变宽度字符集的优点是:

To handle more glyphs than one byte can hold, one approach is to store the fundamental glyphs in one byte, other common glyphs in two bytes, and rarely used glyphs in 3 or even more bytes. This approach is called a Multi byte char set or Variable-width encoding. A very common example is UTF 8, which uses from one up to 4 byte for one character. It stores the ASCII charset in one byte (thus it's also backward compatible to ASCII). The highest bit is defined as a switch: if it's set, other bytes will follow. The same applies to the following bytes, so that a "chain" of up to 4 bytes is formed. The pro's of a variable-width charset are:

  • 向后兼容 7 位 ASCII 字符集
  • 内存友好 - 使用尽可能少的内存

缺点是:

  • 处理起来更困难,而且处理器也很昂贵.你不能简单地迭代一个字符串并假设每个 myString[n] 传递一个字形;相反,如果后面有更多字节,您必须评估每个字节.
  • More difficult and processor-expensive to handle. You can not simply iterate a string and assume that each myString[n] delivers one glyph; instead, you must evaluate each byte, if more bytes are following.

另一种方法是将每个字符存储在由 n 个字节组成的固定长度字中,该字的宽度足以容纳所有可能的字形.这称为固定宽度字符集;所有字符都具有相同的宽度.一个众所周知的例子是 UTF32.它是 32 位宽,可以在一个字中存储所有可能的字符.固定宽度字符集的优缺点显然与可变宽度字符集相反:内存繁重但更易于迭代.

Another approach is to store each character in a fixed-length word made out of n bytes, which is wide enough to hold all possible glyphs. This is called a fixed width charset; all chars have the same width. A well known example is UTF32. It is 32 bit wide and can store all possible characters in one word. The pro's and con's of a fixed width charset are obviously the opposite of a variable-width charset: Memory-heavy but easier to iterate.

但微软甚至在 UTF32 可用之前就选择了他们的原生字符集:他们使用 UTF16 作为 Windows 的字符集,它使用至少 2 个字节(16 位)的字长.这足以存储比单字节字符集更多的字形,但不是全部.考虑到这一点,微软今天区分多字节"和Unicode"有点误导,因为他们的 unicode 实现也是多字节字符集——只是一个字形的最小大小更大的字符集.有人说这是一个很好的妥协,有人说这是两全其美的——无论如何,事情就是这样.而当时(Windows NT)它是唯一可用的Unicode字符集,从这个角度来看,当时他们对多字符和Unicode的区分是正确的(参见Raymond Chen的评论)

But Microsoft chose their native charset even before UTF32 was available: They use UTF16 as the char set of Windows, which uses a word length of at least 2 bytes (16 bit). This is large enough to store a lot more glyphs than in a single byte charset, but not all of them. Considering this, Microsofts differentiation between "Multi byte" and "Unicode" is a bit misleading today, because their unicode implementation is also a multi byte charset - just one with a bigger minimum size for one glyph. Some say that's a good compromise, some say it's the worst of both worlds - anyway, that's the way it is. And at that time (Windows NT) it was the only available Unicode charset, and from this perspective, their distinction between multi char and Unicode was correct at that time (see Raymond Chen's comment)

当然,如果您想将采用一种编码(假设为 UTF8)的字符串转换为另一种编码(假设为 UTF16),则必须对其进行转换.这就是 MultiByteToWideChar 为您所做的,而 WideCharToMultiByte 反之亦然.还有一些其他的转换函数和库.

Of course, if you want to transfer a string in one encoding (let's say UTF8) into another one (let's say UTF16), you have to convert them. Thats what MultiByteToWideChar does for you, and WideCharToMultiByte vice versa. And there are some other conversion functions and libs as well.

这种转换花费了很多时间,所以结论是:如果你大量使用字符串和系统调用,为了性能,你应该使用操作系统的本机字符集,在你的情况下是 UTF16.

This conversion costs pretty much time, and so the conclusion is: If you make heavy use of strings and system calls, for the sake of performance you should use the native charset of your operating system, which would be UTF16 in your case.

因此对于您的字符串处理,您应该选择 wchar_t,在 Windows 的情况下意味着 UTF16.不幸的是,wchar_t 的宽度可能因编译器而异;在Unix下它通常是UTF32,在Windows下它是UTF16.

So for your string handling you should choose wchar_t, which in case of Windows means UTF16. Unfortunately, the width of wchar_t may vary from compiler to compiler; under Unix it is usually UTF32, under Windows it's UTF16.

_MBCS 是一个自动预处理器定义,它告诉您您已将字符集定义为多字节,UNICODE 告诉您已将其设置为 UTF16.

_MBCS is an automatic preprocessor define which tells you that you have defined your character set as multi-byte, UNICODE tells you that you have set it to UTF16.

你可以写

wchar_t* wcMsg = L"مرحبا";
MessageBoxW(0, wcMsg, 0, 0);

即使在没有 UNICODE 定义集的程序中.L" 前缀定义,你的字符串是一个 UNICODE(宽字符)字符串,你可以用它调用系统函数.

even in a program, which hasn't the UNICODE define set. The L" prefix defines, that your string is a UNICODE (wide char) string, and you can call system functions with it.

可惜你不会写

char* msg = u8"مرحبا";
MessageBoxA(0, msg, 0, 0);

C++11 改进了字符集支持,因此您也可以通过前缀 u8 将字符串定义为 UTF8.但是带有A"后缀的 Windows 函数不理解 UTF8,至少在 Windows 10 Build 17035 之前(参见 tambre 的评论)(另请参见 https://stackoverflow.com/a/504789/2328447)这也建议在 Windows/Visual Studio 下使用 UTF16 aka UNICODE.

The char set support has been improved in C++11, and so you can also define a string as UTF8 by the prefix u8. But the windows functions with the "A" postfix doesn't understand UTF8, at least until Windows 10 Build 17035 (see tambre's comment) (see also https://stackoverflow.com/a/504789/2328447) This also suggests to use UTF16 aka UNICODE under Windows/Visual Studio.

将您的项目设置为使用多字节字符集"或使用 Unicode 字符集"也会更改许多其他字符相关定义:最常见的是宏 TCHAR_T() 和所有不带后缀的字符串相关的 Windows 函数,例如MessageBox()(没有 WA 后缀)如果您将项目设置为使用多字节字符集",TCHAR 将扩展为 char_T() 将扩展为空,Windows 函数将附加 A 后缀.如果您将项目设置为使用 Unicode 字符集",TCHAR 将扩展为 wchar_t_T() 将扩展为 L 前缀,Windows 函数将附加 W 后缀.

Setting your project to "Use Multi-Byte Character Set" or "Use Unicode Character Set" also changes a lot of other character dependent defines: The most common ones are the macros TCHAR, _T() and all string dependent Windows functions without postfix, e.g. MessageBox() (without the W or A postfix) If you set your project to "Use Multi-Byte Character Set", TCHAR will expand to char, _T() will expand to nothing, and the Windows functions will get the A postfix attached. If you set your project to "Use Unicode Character Set", TCHAR will expand to wchar_t, _T() will expand to the L prefix, and the Windows functions will get the W postfix attached.

这意味着,写作

TCHAR* msg = _T("Hello");
MessageBox(0, msg, 0, 0);

将使用多字节字符集或 unicode 集进行编译.您可以在 MSDN 中找到一些关于这些主题的综合指南.

will compile both with multi byte charset or unicode set. You can find some comprehensive guides about these topics at MSDN.

不幸的是

TCHAR* msg = _T("مرحبا");
MessageBox(0, msg, 0, 0);

选择使用多字节字符集"时仍然不起作用 - Windows 函数仍然不支持 UTF8,您甚至会收到一些编译器警告,因为您定义了 unicode 字符,这些字符包含在未标记为 Unicode 的字符串(_T() 不会扩展为 u8)

still won't work when "Use Multi-Byte Character Set" is selected - the Windows functions still don't support UTF8, and you will even get some compiler warnings, because you have defined unicode characters, which are contained in a string not marked as Unicode (_T() does not expand to u8)

这篇关于为什么我们从 MultiByte 转换为 WideChar?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆