Windows API:ANSI和宽字符字符串-是UTF8还是ASCII? UTF-16还是UCS-2 LE? [英] Windows API: ANSI and Wide-Character Strings -- Is it UTF8 or ASCII? UTF-16 or UCS-2 LE?

查看:143
本文介绍了Windows API:ANSI和宽字符字符串-是UTF8还是ASCII? UTF-16还是UCS-2 LE?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对编码不是很专业,但这是我想知道的(尽管可能是错误的):

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):

  1. ASCII是7位固定长度的编码,带有您可以在ASCII图表中找到的字符.
  2. UTF8是8位可变长度编码.所有字符都可以用UTF8编写.
  3. UCS-2 LE/BE是固定长度的16位编码,支持大多数常见字符.
  4. UTF-16是一种16位可变长度编码.所有字符都可以用UTF16编写.

最重要的是正确的吗?

现在,对于以下问题:

  1. Windows"A"功能(如SetWindowTextA)是否接受ASCII字符串?还是多字节字符串"(有关此问题,请参见下文)?
  2. Windows"W"函数是否接受UTF-16字符串或UCS-2字符串?我以为它们采用了UCS-2,但名称使我感到困惑.
  3. WideCharToMultiByte 中,Microsoft使用宽字符字符串"一词来表示UTF-16.在这种情况下,那么什么才是多字节字符串"呢? UTF-8?
  4. LPWSTR是宽字符字符串"吗?我会说是,但是那不代表是UTF-16吗?难道这并不意味着它可以显示例如4个字节的字符吗?如果不是,那么...显示4字节字符是不可能的吗? (Windows似乎没有针对这些的API.)
  5. WideCharToMultiByte的功能是否是wcstombs功能的超集,并且它们都适用于相同类型的字符串吗?还是说一个在UTF-16上运行而另一个在UCS-2上运行?
  6. UTF-16或UCS-2中的文件路径吗?我知道Windows从Microsoft文档中将其视为字符的不透明数组",但是按照fwprintf之类的C语言标准,是否存在任何标准化的编码?
  7. 什么是"ANSI"编码?那是一个正确的词吗?以及它与ASCII有什么关系?
  8. (我还有更多问题,但这已经足够了……我还是忘记了其中一些...)
  1. Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
  2. Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
  3. In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
  4. Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
  5. Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
  6. Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
  7. What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
  8. (I had more questions, but this is enough... I forgot some of them anyway...)

这些问题很多,因此,对于所有这些如何连接的解释的任何链接(除了阅读Unicode标准,无论如何都对Windows API毫无帮助)也将不胜感激.

These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.

谢谢!

推荐答案

最重要的是正确的吗?

Are those above all correct?

是的,如果您不假设存在不以Unicode编码的字符(对于大多数实际应用,此假设就可以了).

Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).

Windows"A"功能(例如SetWindowTextA)是否接受ASCII字符串?还是多字节字符串"(有关此问题,请参见下文)?

Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?

它们采用以当前"ANSI"/MBCS/旧版编码方式编码的字节字符串(即,其代码单位为字节的字符串,在Windows上始终为八位字节). "ANSI"是这些编码的历史术语,但不正确.对于西方Windows系统,此编码通常为Windows-1252.

They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.

Windows"W"函数是否接受UTF-16字符串或UCS-2字符串?我以为它们采用了UCS-2,但是名称使我感到困惑.

Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.

自Windows 2000以来,大多数都支持UTF-16.在现代Unicode标准统一术语之前,已选择名称"wide"和Microsoft的其余术语(例如,"Unicode"表示"UTF-16"或"UCS").

Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.

在WideCharToMultiByte中,Microsoft使用单词宽字符字符串"来表示UTF-16.在这种情况下,那么什么才是多字节字符串"呢? UTF-8?

In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?

在这种情况下,WideCharToMultiByte支持的所有其他编码都是多字节编码",包括Windows-1251和UTF-8.

Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.

LPWSTR是宽字符字符串"吗?我会说是,但是那不代表是UTF-16吗?难道这并不意味着它可以显示例如4个字节的字符吗?如果不是,那么...显示4字节字符是不可能的吗? (Windows似乎没有针对这些的API.)

Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)

LPWSTR是指向wchar_t的指针,在Windows上它始终是16位无符号整数.只要该编码可以编码所有Unicode字符,就可以显示哪些字符与该编码无关. Windows通常能够显示非BMP字符,但不能在任何地方显示(例如,控制台不能显示).

LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).

WideCharToMultiByte的功能是否是wcstombs功能的超集,并且它们都可以在相同类型的字符串上工作吗?还是说一个在UTF-16上运行而另一个在UCS-2上运行?

Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?

不是很清楚,但我认为它们之间的差异不大.我想您只是尝试将一些非BMP字符转换为UTF-8,然后看结果是否正确.

Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.

UTF-16或UCS-2中的文件路径吗?我知道Windows从Microsoft文档中将其视为字符的不透明数组",但是按照C函数标准(如fwprintf的C标准),是否存在任何标准化的编码?

Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?

文件路径实际上是UTF-16字符的不透明数组,这意味着Windows在存储或读取文件名时(例如Linux和Mac OS X)不执行任何类型的转换.但是Windows仍然具有其奇怪的,几乎未定义的不区分大小写的行为,这会造成很多麻烦,因为被视为等效的文件名不一定相等.这打破了许多不变性.例如,在Linux上,不受其他线程的干扰,如果在某个目录中成功创建两个文件Aa,则最终将得到两个不同的文件,而在Windows上,您将仅获得一个文件(通常,文件数量无法预测).

File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).

什么是"ANSI"编码?那是一个正确的词吗?以及它与ASCII有什么关系?

What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?

ANSI是美国标准化组织.在指代编码时使用此词是一个错误的名词,但经常出现,因此您应该意识到这一点.我更喜欢旧式8位编码,因为我认为这基本上就是它:一种非Unicode编码,仅保留用于与旧版(Windows 9x)应用程序兼容.在西方系统上,通常是Windows-1252,这是ASCII的适当超集.

ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.

这篇关于Windows API:ANSI和宽字符字符串-是UTF8还是ASCII? UTF-16还是UCS-2 LE?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆