Windows上MBCS和UTF-8之间的区别 [英] Difference between MBCS and UTF-8 on Windows

查看:298
本文介绍了Windows上MBCS和UTF-8之间的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读Windows上的字符集和编码。我注意到有两个编译器标志在Visual Studio编译器(对于C + +)称为MBCS和UNICODE。它们之间有什么区别?我没有得到的是如何UTF-8在概念上不同于MBCS编码?此外,我在 MSDN :

I am reading about the charater set and encodings on Windows. I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ? What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ? Also, I found the following quote in MSDN:


Unicode是一个16位字符编码

Unicode is a 16-bit character encoding

这否定了我读的关于Unicode的任何东西。我认为unicode可以编码与不同的编码,如UTF-8和UTF-16。

This negates whatever I read about the Unicode. I thought unicode can be encoded with different encodings such as UTF-8 and UTF-16. Can somebody shed some more light on this confusion?

推荐答案


我注意到有两个编译器$ Visual Studio编译器中的b $ b标志(用于
C ++)称为MBCS和UNICODE。什么是
它们之间的区别?

I noticed that there are two compiler flags in Visual Studio compiler (for C++) called MBCS and UNICODE. What is the difference between them ?

Windows API中的许多函数有两个版本: c $ c> char 参数(在特定于语言环境的代码页中)和 wchar_t 参数(以UTF-16格式) p>

Many functions in the Windows API come in two versions: One that takes char parameters (in a locale-specific code page) and one that takes wchar_t parameters (in UTF-16).

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

每个函数对都有一个没有后缀的宏,这取决于 UNICODE 宏。

Each of these function pairs also has a macro without the suffix, that depends on whether the UNICODE macro is defined.

#ifdef UNICODE
   #define MessageBox MessageBoxW
#else
   #define MessageBox MessageBoxA
#endif

为了使这项工作, TCHAR 类型被定义为抽象出API函数使用的字符类型。

In order to make this work, the TCHAR type is defined to abstract away the character type used by the API functions.

#ifdef UNICODE
    typedef wchar_t TCHAR;
#else
    typedef char TCHAR;
#endif

但是,是一个坏主意。您应该总是明确指定字符类型。

This, however, was a bad idea. You should always explicitly specify the character type.


我没有得到的是UTF-8是如何
概念上不同于MBCS
encoding?

What I am not getting is how UTF-8 is conceptually different from a MBCS encoding ?

MBCS代表多字节字符集。对于字面意图,似乎UTF-8将有资格。

MBCS stands for "multi-byte character set". For the literal minded, it seems that UTF-8 would qualify.

但在Windows中,MBCS只是指可以与A Windows API函数的版本。这包括代码页932(Shift_JIS),936(GBK),949(KS_C_5601-1987)和950(Big5),但 NOT UTF-8。

But in Windows, "MBCS" only refers to character encodings that can be used with the "A" versions of the Windows API functions. This includes code pages 932 (Shift_JIS), 936 (GBK), 949 (KS_C_5601-1987), and 950 (Big5), but NOT UTF-8.

要使用UTF-8,必须使用 MultiByteToWideChar 将字符串转换为UTF-16,调用函数的W版本, code> WideCharToMultiByte 。这本质上是A功能实际上做的,这让我想知道为什么Windows不仅支持UTF-8

To use UTF-8, you have to convert the string to UTF-16 using MultiByteToWideChar, call the "W" version of the function, and call WideCharToMultiByte on the output. This is essentially what the "A" functions actually do, which makes me wonder why Windows doesn't just support UTF-8.

这不能支持最常见的字符编码使得A版本的Windows API无用。因此,您应始终使用W函数

This inability to support the most common character encoding makes the "A" version of the Windows API useless. Therefore, you should always use the "W" functions.


Unicode是一个16位字符编码

Unicode is a 16-bit character encoding

这会否定我阅读的关于
Unicode的任何内容。

This negates whatever I read about the Unicode.

MSDN是错误的。 Unicode是一种具有多种编码的21位编码字符集,最常见的是UTF-8,UTF-16和UTF-32。 (还有其他Unicode编码,如GB18030,UTF-7和UTF-EBCDIC。)

MSDN is wrong. Unicode is a 21-bit coded character set that has several encodings, the most common being UTF-8, UTF-16, and UTF-32. (There are other Unicode encodings as well, such as GB18030, UTF-7, and UTF-EBCDIC.)

每当Microsoft提到Unicode时,它们真的意味着UTF -16(或UCS-2)。这是因为历史原因。 Windows NT是Unicode的早期采用者,当16位被认为足以满足每个人的需要时,而UTF-8仅用于计划9.因此UCS-2 Unicode。

Whenever Microsoft refers to "Unicode", they really mean UTF-16 (or UCS-2). This is for historical reasons. Windows NT was an early adopter of Unicode, back when 16 bits was thought to be enough for everyone, and UTF-8 was only used on Plan 9. So UCS-2 was Unicode.

这篇关于Windows上MBCS和UTF-8之间的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆