Unicode,UTF16,.NET - 字符串,BMP中的CodePoint(基本多语言平面) [英] Unicode , UTF16, .NET – string, CodePoint out of BMP (Basic Multilingual Plane)

查看:127
本文介绍了Unicode,UTF16,.NET - 字符串,BMP中的CodePoint(基本多语言平面)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

亲爱的专家

我正试图了解.NET字符串和Unicode的所有细节。从我读到的,.NET字符串是UTF16编码。



基于这个知识/假设我试图看看BMP的代码点将如何由.net字符串处理并在我的第一个实验中失败。



我的代码

Dear experts
I’m trying to understand .NET strings and Unicode in all its details. From what I read, .NET strings are UTF16 coded.

Based on this "knowledge"/"assumption" I tried to see how Code Points out of BMP will be handled by .net strings and failed on my very first experiment.

My code

int music = 0x1D161; //U+1D161 = MUSICAL SYMBOL SIXTEENTH NOTE
string s1;
s1= Char.ConvertFromUtf32(music);
textBox1.Text = s1;



使用上面的代码我预计

a。)在文本框中看到音乐符号,但我只看到一个正方形

b。)s1.Length返回一个(即使代码点需要两个代码单元 - 代理对 - ?),但长度返回2



有没有人可以解释我错在哪里?



非常感谢您提前。

Bruno


With the above code I expected
a.) to see the musical symbol in the text box , but I see only a square
b.) s1.Length returns one (even the code point needs two code units – surrogate pair- ? ), but Length returns 2

Does anybody can explain me where I’m wrong?

Thank you very much in advance.
Bruno

推荐答案

您不应该假设任何特定的UTF编码。所有的.NET API都很好地从这种表示中抽象出来。请记住,对于除UTF-32之外的所有UTF,字符使用不同的字节数表示。使用UTF-16,BMP之外的字符使用代理项对进行编码。您可以使用 Encoding 类对它们进行序列化。至于Unicode代码点,应该理解它们和纯数学整数,完全从它们的计算机表示中抽象出来,按照自然顺序。



至于你的特殊问题,你的方法是正确的,因为在0 ... 10FFFF范围内的UTF32LE的编码与编码的代码点完全相同。但是,我从来没有看过支持这种音乐符号范围的Windows字体( http://unicode.org/charts/PDF/U1D100 .pdf [ ^ ])。也许,这是唯一的问题。



关于第二个问题:是的,长度1是正确的。该属性返回字符数,而不是16位字。您将代码点转换为代理对,对吧?这是一个角色。



-SA
You should never assume any particular UTF encoding. All the .NET API is well abstracted from this representation. Remember that with all UTFs except UTF-32, the characters are represented using different number of bytes. With UTF-16, characters beyond BMP are encoded using surrogate pairs. You can serialize them using the Encoding class. As to the Unicode code points, they should be understood and pure mathematical integer number, fully abstracted from their computer presentation, in natural order.

As to your particular problem, your approach is correct, because UTF32LE, in the range 0.. 10FFFF is encoded exactly as the code point would be encoded. However, I never saw a Windows font supporting this range for musical notation (http://unicode.org/charts/PDF/U1D100.pdf[^]). Maybe, this is the only problem.

As to the second question: yes, the length 1 is correct. The property returns number of characters, not 16-bit words. You converted the code point to a surrogate pair, right? And this is one character.

—SA


这篇关于Unicode,UTF16,.NET - 字符串,BMP中的CodePoint(基本多语言平面)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆