使用UNI code字大于2个字节的.Net [英] Using unicode characters bigger than 2 bytes with .Net

查看:231
本文介绍了使用UNI code字大于2个字节的.Net的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我用这个code生成 U + 10FFFC

I'm using this code to generate U+10FFFC

var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC});

我知道这是私人使用等,但它确实显示为显示时,我期望一个字符。这些问题操作该UNI code字时,来了。

I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. The problems come when manipulating this unicode character.

如果我后来做到这一点:

If I later do this:

foreach(var ch in s)
{
    Console.WriteLine(ch);
}

相反,它打印只是单个字符,它输出两个字符(即字符串显然是由两个字符)。如果我改变我的循环回这些字符添加一个空字符串,像这样:

Instead of it printing just the single character, it prints two characters (i.e. the string is apparently composed of two characters). If I alter my loop to add these characters back to an empty string like so:

string tmp="";
foreach(var ch in s)
{
    Console.WriteLine(ch);
    tmp += ch;
}

在本月底, TMP 将打印只是一个单一的字符。

At the end of this, tmp will print just a single character.

究竟是怎么回事?我认为,字符包含一个单code的性格,我从来没有担心有多少个字节的字符,除非我做的转换为字节。我真正的用例是我需要能够检测时,非常大的单code字符字符串中的使用。目前,我有这样的事情:

What exactly is going on here? I thought that char contains a single unicode character and I never had to worry about how many bytes a character is unless I'm doing conversion to bytes. My real use case is I need to be able to detect when very large unicode characters are used in a string. Currently I have something like this:

foreach(var ch in s)
{
    if(ch>=0x100000 && ch<=0x10FFFF)
    {
        Console.WriteLine("special character!");
    }
}

但是,由于此分束的非常大的字符,这是行不通的。我怎么能修改此使它工作?

However, because of this splitting of very large characters, this doesn't work. How can I modify this to make it work?

推荐答案

U + 10FFFC一个统一code code点,但字符串的接口不直接暴露的统一code code点的序列。其接口公开的UTF-16 code单元的序列。即文本的一个非常低的水平视图。这是很不幸的是文本的这样一个低级别的观点嫁接到了最明显和直观的界面可...我会尽量不夸夸其谈太多关于如何我不喜欢这样的设计,只是说没有关系多么不幸,它只是一个(SAD)的事实,你必须忍受。

U+10FFFC is one Unicode code point, but string's interface does not expose a sequence of Unicode code points directly. Its interface exposes a sequence of UTF-16 code units. That is a very low-level view of text. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available... I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.

首先,我会建议使用<一个href="http://msdn.microsoft.com/en-us/library/system.char.convertfromutf32.aspx"><$c$c>char.ConvertFromUtf32让你的初始字符串。更简单,更易于阅读:

First off, I will suggest using char.ConvertFromUtf32 to get your initial string. Much simpler, much more readable:

var s = char.ConvertFromUtf32(0x10FFFC);

那么,这个字符串的长度不为1,因为,正如我所说,界面优惠UTF-16 code单位,没有统一code code点。 U + 10FFFC使用两个UTF-16 code单位,所以 s.Length 为2以上的U + FFFF所有code点需要两个UTF-16 code单元为他们重新presentation。

So, this string's Length is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. U+10FFFC uses two UTF-16 code units, so s.Length is 2. All code points above U+FFFF require two UTF-16 code units for their representation.

您应该注意的是 ConvertFromUtf32 不返回字符字符是一个UTF-16 code单元,而不是统一code code点。为了能够返回所有的Uni code code点,这个方法不能返回一个字符。有时它需要返回两个,这就是为什么它使一个字符串。有时候你会发现某些API买卖 INT 代替的S 字符,因为 INT 可以用来处理所有code点过(这就是 ConvertFromUtf32 需要作为参数,什么 ConvertToUtf32 产生的结果)。

You should note that ConvertFromUtf32 doesn't return a char: char is a UTF-16 code unit, not a Unicode code point. To be able to return all Unicode code points, that method cannot return a single char. Sometimes it needs to return two, and that's why it makes it a string. Sometimes you will find some APIs dealing in ints instead of char because int can be used to handle all code points too (that's what ConvertFromUtf32 takes as argument, and what ConvertToUtf32 produces as result).

字符串工具的IEnumerable&LT;焦炭&GT; ,这意味着,当你遍历一个字符串你得到一个UTF-16 code每次迭代单位。这就是为什么你遍历字符串,并打印出来会产生一些破碎的输出,在这两个东西。这些是两个UTF-16 code单元组成U + 10FFFC的再presentation。他们被称为代理人。第一种是高/铅替代,第二个是低/踪迹代理。当你单独打印他们,他们不会产生有意义的输出,因为孤独的代理人,甚至没有有效的UTF-16,他们不被视为统一code字无论是。

string implements IEnumerable<char>, which means that when you iterate over a string you get one UTF-16 code unit per iteration. That's why iterating your string and printing it out yields some broken output with two "things" in it. Those are the two UTF-16 code units that make up the representation of U+10FFFC. They are called "surrogates". The first one is a high/lead surrogate and the second one is a low/trail surrogate. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either.

在追加这两个代理人的字符串中的循环,将有效地重建代理对,并打印这对后来的为一个的让你正确的输出。

When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.

而在咆哮前,请注意如何没有什么抱怨,你在循环中使用一个畸形的UTF-16序列。它与孤独的替代创建一个字符串,但一切都进行好像什么都没有发生了:字符串类型不是合式 UTF-16 code单元序列,但任何 UTF-16 code单元序列类型。

And in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence.

字符结构提供静态方法来处理代理人:的isHighSurrogate IsLowSurrogate IsSurrogatePair <​​/ code>, ConvertToUtf32 ConvertFromUtf32 。如果你愿意,你可以写一个迭代的迭代器统一code字符,而不是UTF-16 code单位:

The char structure provides static methods to deal with surrogates: IsHighSurrogate, IsLowSurrogate, IsSurrogatePair, ConvertToUtf32, and ConvertFromUtf32. If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units:

static IEnumerable<int> AsCodePoints(this string s)
{
    for(int i = 0; i < s.Length; ++i)
    {
        yield return char.ConvertToUtf32(s, i);
        if(char.IsHighSurrogate(s, i))
            i++;
    }
}

然后就可以遍历这样的:

Then you can iterate like:

foreach(int codePoint in s.AsCodePoints())
{
     // do stuff. codePoint will be an int will value 0x10FFFC in your example
}

如果您preFER让每个code点为一个字符串,而不是改变返回类型为的IEnumerable&LT;字符串&GT; ,收线:

If you prefer to get each code point as a string instead change the return type to IEnumerable<string> and the yield line to:

yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i));

使用该版本,下面的作品,是:

With that version, the following works as-is:

foreach(string codePoint in s.AsCodePoints())
{
     Console.WriteLine(codePoint);
}

这篇关于使用UNI code字大于2个字节的.Net的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆