UTF16(例如wide-winapi函数使用的)字符总是2字节长? [英] Are UTF16 (as used by for example wide-winapi functions) characters always 2 byte long?

查看:211
本文介绍了UTF16(例如wide-winapi函数使用的)字符总是2字节长?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请澄清一下,UTF16是如何工作的?
我有点困惑,考虑这些点:




  • 在C ++中有一个静态类型,WCHAR,长。 (总是2字节长)obsouly)

  • 大部分的msdn和一些其他文档似乎有假设,字符总是2字节长。这可能只是我的想象力,我不能提出任何特定的例子,但它只是似乎这样。

  • 没有extra wide函数或字符类型广泛使用C ++或windows,所以我假设UTF16是所有需要的。

  • 对我不确定的知识,unicode有更多的字符比65535,所以他们obvisouly没有足够

  • UTF16似乎是UTF8的更大版本,UTF8字符可以有不同的长度。



所以如果一个UTF16字符不总是2个字节长,还有多长时间? 3字节?或只有2的倍数?
然后例如,如果有一个winapi函数想知道一个宽字符串的字符的大小,并且该字符串包含2个字符,每个4字节长,那么该字符串的大小如何以字元计算?



它是2个字符长还是4个字符长? (因为它是8字节长,每个WCHAR是2字节)



更新:现在我看到字符计数不一定是一个标准的东西或c ++的东西甚至,所以我将尝试在我的第二个问题,一个宽字符串的字符的长度一点更具体:



在Windows上,在Winapi中,在它们的宽函数(以W结尾)中,如何计算由2个unicode代码点组成的字符串中的每个由2个代码单元(总共8个字节)组成的字符的数字?这样的字符串是否为2个字符长(与编码点数相同)或4个字符长(与代码单元总数相同?)



宽字符串中的字符数的窗口定义是什么意思,代码点数或代码单元数?

解决方案

简短答案:否。



wchar_t (基本字符单位)的大小 (参见第3.9.1节第5节)。实际上,在Windows平台上,它有两个字节长,而在Linux / Mac平台上,它长四个字节。



此外,字符存储在endian-具体格式。在Windows上,这通常意味着小端序,但它也适用于 wchar_t 包含大端数据。



此外,即使每个 wchar_t 是两个(或四个)字节长,单个字形(粗略地,一个字符)可能需要多个 wchar_t s,并且可能有多种表示方式。



一个常见的例子是字符é code> LATIN SMALL LETTER E WITH ACUTE ),代码点0x00E9。这也可以表示为分解的代码点序列0x0065 0x0301( LATIN SMALL LETTER E 后跟组合ACUTE ACCENT )。两者都有效;有关详细信息,请参阅有关 Unicode对等性的维基百科文章。



简单来说,你需要知道或选择你将使用的编码。如果处理Windows API,一个简单的选择是假设一切都是小端字符UTF-16存储在2字节 wchar_t 中。



在Linux / Mac上,UTF-8( char )更常见,API通常采用UTF-8。 wchar_t 被视为浪费,因为它每个字符使用4个字节。



您可能希望在内部使用UTF-8,并在调用Windows API时即时转换为UTF-16。 Windows提供了 MultiByteToWideChar WideCharToMultiByte 功能可以执行此操作,您可以还可以找到简化使用这些功能的包装器,例如 ATL和MFC字符串转换宏。 / p>

更新



问题已更新,询问什么Windows API是什么意思当他们要求 。



如果API说字符串大小,它们指的是 wchar_t s(或者如果由于某种原因在非Unicode模式下编译,则 char 的数字)。在该特定情况下中,您可以忽略Unicode字符可能需要多个 wchar_t 的事实。这些API只是想填补一个缓冲区,需要知道他们有多少空间。


Please clarify for me, how does UTF16 work? I am a little confused, considering these points:

  • There is a static type in C++, WCHAR, which is 2 bytes long. (always 2 bytes long obvisouly)
  • Most of msdn and some other documentation seem to have the assumptions that the characters are always 2 bytes long. This can just be my imagination, I can't come up with any particular examples, but it just seems that way.
  • There are no "extra wide" functions or characters types widely used in C++ or windows, so I would assume that UTF16 is all that is ever needed.
  • To my uncertain knowledge, unicode has a lot more characters than 65535, so they obvisouly don't have enough space in 2 bytes.
  • UTF16 seems to be a bigger version of UTF8, and UTF8 characters can be of different lengths.

So if a UTF16 character not always 2 bytes long, how long else could it be? 3 bytes? or only multiples of 2? And then for example if there is a winapi function that wants to know the size of a wide string in characters, and the string contains 2 characters which are each 4 bytes long, how is the size of that string in characters calculated?

Is it 2 chars long or 4 chars long? (since it is 8 bytes long, and each WCHAR is 2 bytes)

UPDATE: Now I see that character-counting is not necessarily a standard-thing or a c++ thing even, so I'll try to be a little more specific in my second question, about the length in "characters" of a wide string:

On Windows, specifically, in Winapi, in their wide functions (ending with W), how does one count the numer of characters in a string that consists of 2 unicode codepoints, each consisting of 2 codeunits (total of 8 bytes)? Is such a string 2 characters long (the same as number of codepoints) or 4 characters long(the same as total number of codeunits?)

Or, being more generic: What does the windows definition of "number of characters in a wide string" mean, number of codepoints or number of codeunits?

解决方案

Short answer: No.

The size of a wchar_t—the basic character unit—is not defined by the C++ Standard (see section 3.9.1 paragraph 5). In practice, on Windows platforms it is two bytes long, and on Linux/Mac platforms it is four bytes long.

In addition, the characters are stored in an endian-specific format. On Windows this usually means little-endian, but it’s also valid for a wchar_t to contain big-endian data.

Furthermore, even though each wchar_t is two (or four) bytes long, an individual glyph (roughly, a character) could require multiple wchar_ts, and there may be more than one way to represent it.

A common example is the character é (LATIN SMALL LETTER E WITH ACUTE), code point 0x00E9. This can also be represented as "decomposed" code point sequence 0x0065 0x0301 (which is LATIN SMALL LETTER E followed by COMBINING ACUTE ACCENT). Both are valid; see the Wikipedia article on Unicode equivalence for more information.

Simply, you need to know or pick the encoding that you will be using. If dealing with Windows APIs, an easy choice is to assume everything is little-endian UTF-16 stored in 2-byte wchar_ts.

On Linux/Mac UTF-8 (with chars) is more common and APIs usually take UTF-8. wchar_t is seen to be wasteful because it uses 4 bytes per character.

For cross-platform programming, therefore, you may wish to work with UTF-8 internally and convert to UTF-16 on-the-fly when calling Windows APIs. Windows provides the MultiByteToWideChar and WideCharToMultiByte functions to do this, and you can also find wrappers that simplify using these functions, such as the ATL and MFC String Conversion Macros.

Update

The question has been updated to ask what Windows APIs mean when they ask for the "number of characters" in a string.

If the API says "size of the string in characters" they are referring to the number of wchar_ts (or the number of chars if you are compiling in non-Unicode mode for some reason). In that specific case you can ignore the fact that a Unicode character may take more than one wchar_t. Those APIs are just looking to fill a buffer and need to know how much room they have.

这篇关于UTF16(例如wide-winapi函数使用的)字符总是2字节长?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆