在 .Net 中使用大于 2 个字节的 unicode 字符 [英] Using unicode characters bigger than 2 bytes with .Net

查看:29
本文介绍了在 .Net 中使用大于 2 个字节的 unicode 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用此代码生成 U+10FFFC

I'm using this code to generate U+10FFFC

var s = Encoding.UTF8.GetString(new byte[] {0xF4,0x8F,0xBF,0xBC});

我知道它是供私人使用的,但它确实显示了一个字符,正如我在显示它时所期望的那样.操作此 Unicode 字符时会出现问题.

I know it's for private-use and such, but it does display a single character as I'd expect when displaying it. The problems come when manipulating this unicode character.

如果我以后这样做:

foreach(var ch in s)
{
    Console.WriteLine(ch);
}

它不是只打印单个字符,而是打印两个字符(即字符串显然由两个字符组成).如果我更改循环以将这些字符添加回空字符串,如下所示:

Instead of it printing just the single character, it prints two characters (i.e. the string is apparently composed of two characters). If I alter my loop to add these characters back to an empty string like so:

string tmp="";
foreach(var ch in s)
{
    Console.WriteLine(ch);
    tmp += ch;
}

在此结束时,tmp 将只打印一个字符.

At the end of this, tmp will print just a single character.

这里到底发生了什么?我认为 char 包含一个 unicode 字符,除非我正在转换为字节,否则我不必担心一个字符有多少字节.我真正的用例是我需要能够检测到字符串中何时使用了非常大的 unicode 字符.目前我有这样的事情:

What exactly is going on here? I thought that char contains a single unicode character and I never had to worry about how many bytes a character is unless I'm doing conversion to bytes. My real use case is I need to be able to detect when very large unicode characters are used in a string. Currently I have something like this:

foreach(var ch in s)
{
    if(ch>=0x100000 && ch<=0x10FFFF)
    {
        Console.WriteLine("special character!");
    }
}

但是,由于这种拆分非常大的字符,这不起作用.如何修改它以使其正常工作?

However, because of this splitting of very large characters, this doesn't work. How can I modify this to make it work?

推荐答案

U+10FFFC 是一个 Unicode 码位,但 string 的接口并没有直接暴露一系列 Unicode 码位.它的接口公开了一系列 UTF-16 代码单元.这是一个非常低级的文本视图.不幸的是,如此低级的文本视图被移植到可用的最明显和最直观的界面上......我会尽量不要过多地抱怨我不喜欢这种设计,只是说没关系多么不幸,这只是一个你必须忍受的(悲伤的)事实.

U+10FFFC is one Unicode code point, but string's interface does not expose a sequence of Unicode code points directly. Its interface exposes a sequence of UTF-16 code units. That is a very low-level view of text. It is quite unfortunate that such a low-level view of text was grafted onto the most obvious and intuitive interface available... I'll try not to rant much about how I don't like this design, and just say that not matter how unfortunate, it is just a (sad) fact you have to live with.

首先,我建议使用 char.ConvertFromUtf32 获取您的初始字符串.更简单,更易读:

First off, I will suggest using char.ConvertFromUtf32 to get your initial string. Much simpler, much more readable:

var s = char.ConvertFromUtf32(0x10FFFC);

所以,这个字符串的 Length 不是 1,因为正如我所说,接口处理的是 UTF-16 代码单元,而不是 Unicode 代码点.U+10FFFC 使用两个 UTF-16 代码单元,因此 s.Length 为 2.U+FFFF 以上的所有代码点都需要两个 UTF-16 代码单元来表示.

So, this string's Length is not 1, because, as I said, the interface deals in UTF-16 code units, not Unicode code points. U+10FFFC uses two UTF-16 code units, so s.Length is 2. All code points above U+FFFF require two UTF-16 code units for their representation.

您应该注意 ConvertFromUtf32 不返回 char:char 是 UTF-16 代码单元,而不是 Unicode 代码点.为了能够返回所有 Unicode 代码点,该方法不能返回单个 char.有时它需要返回两个,这就是为什么它使它成为一个字符串.有时你会发现一些 API 处理的是 ints 而不是 char 因为 int 也可以用来处理所有的代码点(这就是 ConvertFromUtf32 作为参数,ConvertToUtf32 作为结果产生).

You should note that ConvertFromUtf32 doesn't return a char: char is a UTF-16 code unit, not a Unicode code point. To be able to return all Unicode code points, that method cannot return a single char. Sometimes it needs to return two, and that's why it makes it a string. Sometimes you will find some APIs dealing in ints instead of char because int can be used to handle all code points too (that's what ConvertFromUtf32 takes as argument, and what ConvertToUtf32 produces as result).

string 实现 IEnumerable<char>,这意味着当您迭代 string 时,每次迭代都会得到一个 UTF-16 代码单元.这就是为什么迭代你的字符串并将其打印出来会产生一些带有两个东西"的损坏输出.这些是构成 U+10FFFC 表示的两个 UTF-16 代码单元.他们被称为代理人".第一个是高/领先替代品,第二个是低/落后替代品.当您单独打印它们时,它们不会产生有意义的输出,因为单独的代理在 UTF-16 中甚至无效,并且它们也不被视为 Unicode 字符.

string implements IEnumerable<char>, which means that when you iterate over a string you get one UTF-16 code unit per iteration. That's why iterating your string and printing it out yields some broken output with two "things" in it. Those are the two UTF-16 code units that make up the representation of U+10FFFC. They are called "surrogates". The first one is a high/lead surrogate and the second one is a low/trail surrogate. When you print them individually they do not produce meaningful output because lone surrogates are not even valid in UTF-16, and they are not considered Unicode characters either.

当您将这两个代理项附加到循环中的字符串时,您可以有效地重构代理项对,并且稍后打印该对作为一个为您提供正确的输出.

When you append those two surrogates to the string in the loop, you effectively reconstruct the surrogate pair, and printing that pair later as one gets you the right output.

在咆哮的前面,请注意没有人抱怨您在该循环中使用了格式错误的 UTF-16 序列.它创建了一个带有唯一代理项的字符串,但一切都像什么都没发生一样继续进行:string 类型甚至不是 well-formed UTF-16 代码单元的类型序列,但 any UTF-16 代码单元序列的类型.

And in the ranting front, note how nothing complains that you used a malformed UTF-16 sequence in that loop. It creates a string with a lone surrogate, and yet everything carries on as if nothing happened: the string type is not even the type of well-formed UTF-16 code unit sequences, but the type of any UTF-16 code unit sequence.

char结构提供静态方法来处理具有代理项:IsHighSurrogateIsLowSurrogateIsSurrogatePairConvertToUtf32ConvertFromUtf32.如果你愿意,你可以编写一个迭代 Unicode 字符而不是 UTF-16 代码单元的迭代器:

The char structure provides static methods to deal with surrogates: IsHighSurrogate, IsLowSurrogate, IsSurrogatePair, ConvertToUtf32, and ConvertFromUtf32. If you want you can write an iterator that iterates over Unicode characters instead of UTF-16 code units:

static IEnumerable<int> AsCodePoints(this string s)
{
    for(int i = 0; i < s.Length; ++i)
    {
        yield return char.ConvertToUtf32(s, i);
        if(char.IsHighSurrogate(s, i))
            i++;
    }
}

然后你可以像这样迭代:

Then you can iterate like:

foreach(int codePoint in s.AsCodePoints())
{
     // do stuff. codePoint will be an int will value 0x10FFFC in your example
}

如果您希望将每个代码点作为字符串获取,请将返回类型更改为 IEnumerable<string> 并将屈服行更改为:

If you prefer to get each code point as a string instead change the return type to IEnumerable<string> and the yield line to:

yield return char.ConvertFromUtf32(char.ConvertToUtf32(s, i));

使用该版本,以下内容按原样工作:

With that version, the following works as-is:

foreach(string codePoint in s.AsCodePoints())
{
     Console.WriteLine(codePoint);
}

这篇关于在 .Net 中使用大于 2 个字节的 unicode 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆