16位wchar_t类型的最大代码点是什么? [英] What is the largest code point for 16-bit wchar_t type?

查看:141
本文介绍了16位wchar_t类型的最大代码点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此处表示UTF-16的最大代码点是10FFFF

It is said here that UTF-16's largest code point is 10FFFF

也写在那个页面上

BMP字符需要一个16位代码单元才能处理或存储.

BMP characters require one 16-bit code unit to process or store.

但是以位表示形式10FFFF

0001 0000   1111 1111   1111 1111

我们看到它占用了16位wchar_t的15位以上 (允许实现仅支持> = 0值的宽字符,而与wchar_t的符号无关)

We see that it occupies more than 15 bits of 16-bit wchar_t (an implementation is allowed to support wide characters with >=0 value only, independently of signedness of wchar_t)

16位wchar_t real 最大代码点是什么?

What is the real largest code point for 16-bit wchar_t?

推荐答案

这里说UTF-16的最大代码点是10FFFF

It is said here that UTF-16's largest code point is 10FFFF

是的,但是您误解了要从中绘制表格的表格.

Yes, but you are misinterpreting the table that you are drawing that from.

U + 10FFFF是最大的Unicode 代码点值. UTF-16本身不是Unicode,它是使用 16位代码单元的Unicode代码点的 encoding (就像UTF-8是 encoding 使用8位代码单元).如您所述,仅16位不足以表示Unicode代码点值的整个范围. Unicode代码点U + 0000-U + FFFF的UTF-16编码仅需要1个代码单元,但是代码点U + 10000-U + 10FFFF的编码需要2个代码单元一起工作,称为代理对". UTF-16是UCS-2的后继版本,UCS-2是Unicode的原始16位编码,但只能编码U + 0000-U + FFFF的代码点. UTF-16向后兼容UCS-2,但是添加代理对允许UTF-16支持所有Unicode代码点.

U+10FFFF is the largest Unicode code point value. UTF-16 is not Unicode itself, it is an encoding of Unicode code points using 16-bit code units (just as UTF-8 is an encoding using 8-bit code units) . As you remarked, 16 bits is not enough to represent the full range of Unicode code point values. The UTF-16 encoding of Unicode code points U+0000 - U+FFFF requires only 1 code unit, but the encoding of code points U+10000 - U+10FFFF requires 2 code units acting together, known as a "surrogate pair". UTF-16 is the successor to UCS-2, which was the original 16-bit encoding for Unicode but it could only encode code points U+0000 - U+FFFF. UTF-16 is backwards compatible with UCS-2, but adding surrogate pairs allows UTF-16 to support the full range of Unicode code points.

UTF-16被设计为为此目的保留了可形成代理对的代码单元值.即使它们看起来不成对(因此必须是无效的代码序列),也不能将它们误解为常规字符.

UTF-16 is designed so that the code unit values from which surrogate pairs can be formed are reserved for that purpose. They cannot be misinterpreted as regular characters, even when they appear unpaired (in what therefore must be an invalid code sequence).

还要注意,对于C实现来说,将UTF-16(或UTF-8)称为字符集"是一种滥用,尽管很常见,因为它们的代码单元并不都对应1-1与Unicode字符.或者,至少它们所对应的字符必须被解释为它们所对应的代码单元.这是一种有效解决大范围字符问题的实用方法.

Note also that it's a bit of an abuse, albeit a common one, for a C implementation to call UTF-16 (or UTF-8) a "character set", as their code units do not all correspond 1-1 with Unicode characters. Or, at least the characters to which they correspond have to be interpreted as the code units that they are. It's a pragmatic approach to the problem of efficiently representing characters from a large range.

也写在那个页面上

Also it is written on that page that

BMP字符需要一个16位代码单元才能处理或存储.

BMP characters require one 16-bit code unit to process or store.

也是如此.您显然已经忽略了BMP(基本多语言平面,代码点U + 0000-U + FFFF)字符是所有Unicode字符的子集的事实.实际上,它们的1/17或更少,取决于您的计算方式.它们的代码点值全部可以用16位表示(即以一个UTF-16代码单元表示),这一事实实际上可以视为该子集的定义.

That is also true. You apparently have overlooked the fact that BMP (Basic Multilingual Plane, code points U+0000 - U+FFFF) characters are a subset of all Unicode characters. 1/17th of them, in fact, or somewhat less, depending on how you count. The fact that their code point values can all be represented with 16 bits (i.e. in one UTF-16 code unit) could in fact be taken as a definition of that subset.

我们看到它占用了15位以上的16位wchar_t( 实现允许支持> = 0值的宽字符 仅,与wchar_t的签名无关)

We see that it occupies more than 15 bits of 16-bit wchar_t (an implementation is allowed to support wide characters with >=0 value only, independently of signedness of wchar_t)

否,正如我在回答您最近遇到的其他问题之一中所述.该标准没有对C实现施加任何限制,以仅支持非负代码点值.那只是所有当前广泛使用的编码字符集的代码点分配的 deacto 状态.遵循wchar_t签名的标准C实现可以提供一个字符集,其中某些扩展字符具有对应的wchar_t负值.

No, as we covered in my answer to one of your other recent questions. The standard imposes no restriction on C implementations to support only non-negative code point values. That's just the de facto state of the code point assignments of all current, widely-used coded character sets. A conforming C implementation on which wchar_t is signed could provide a character set in which some extended characters have negative corresponding wchar_t values.

16位wchar_t的真正最大代码点是什么?

What is the real largest code point for 16-bit wchar_t?

与上述任何内容均无关.实际上,这没有多大意义.代码点值是(编码的)字符集的特征,而不是任何C数据类型的特征.它们是与该集支持的字符相对应的数字.

That has nothing to do with any of the foregoing. In fact, it doesn't make much sense. Code point values are a characteristic of (coded) character sets, not of any C data type. They are the numbers corresponding to the characters supported by that set.

如果C实现声称提供UTF-16作为受支持的字符集,则其wchar_t必须至少具有16个值位,因为该类型必须能够表示所有UTF-16代码单元值.如果该类型总共只有16位,则它们必须全部是值位,这使得该类型必须是无符号的,并且能够支持最多0xFFFF的值.

If a C implementation claims to provide UTF-16 as a supported character set, then it follows that its wchar_t must have at least 16 value bits, because that type must be able to represent all UTF-16 code unit values. If that type has only 16 bits altogether then they must all be value bits, making the type necessarily unsigned, and capable of supporting values up to 0xFFFF.

这篇关于16位wchar_t类型的最大代码点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆