在.Net中使用代理对的Unicode [英] Unicode with Surrogate Pairs in .Net

查看:87
本文介绍了在.Net中使用代理对的Unicode的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我了解,.Net表示使用16位字符对(或代理对)的32位字符。但是,我无法找到任何将这些对作为单个字符的函数。例如,Windows窗体能够将此代理项对显示为单个字符:

  //  < span class =code-comment>这会显示我期望的字符。 
MessageBox.Show( char .ConvertFromUtf32( int .Parse( 2A601,NumberStyles.HexNumber )));



但是,当我得到该字符串的长度时,它是2(我希望它是1):

  //  显示2而不是1。 
MessageBox .Show( char .ConvertFromUtf32( int .Parse( 2A601,NumberStyles.HexNumber))。Length.ToString());



另外当我得到第一个字符时,一些块字符是s而不是我期望的角色:

  //  显示  而不是

As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters. However, I haven't been able to find any functions which deal with these pairs as a single character. For example, Windows forms is capable of displaying this surrogate pair as a single character:

// This displays the character as I expect.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)));


However, when I get the length of that string, it is 2 (I would expect it to be 1):

// Shows 2 rather than 1.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Length.ToString());


Also, when I get the first character, some block character is shown rather than the character I expect:

// Shows �� rather than 𪘁.
MessageBox.Show(char.ConvertFromUtf32(int.Parse("2A601", NumberStyles.HexNumber)).Substring(0, 1));


FYI, you may need something installed to see the special characters above, but you should get the point even if you can't see them.
Basically, I would like to know if there are any string functions to handle surrogate pairs properly (e.g., index them correctly, count them as a single character rather than two). Or, if I'm looking at the concept of surrogate pairs wrong, feel free to correct me.

解决方案

"As I understand it, .Net represents 32-bit characters using a pair (or "surrogate pair") of 16-bit characters"

.Net uses UTF16 - and you may find this interesting:
http://www.unicode.org/notes/tn12/[^]

and this http://www.yoda.arachsys.com/csharp/unicode.html[^]

Libraries like http://site.icu-project.org/[^] takes surrogate pairs into account, using an iterator approach - while .Net seems to treat UTF16 as UCS16. While I suspect that that the underlying OS features implements and uses UTF16 more in line with the standard.

As SAKryukov mentions UnicodeEncoding actually takes these things into account - but it seems that the usual practise is to only consider the length of the string - and that usually tends to work out nicely anyway, unless you are doing character by character processing.

To get more than a box - you need to use a font that supports the characters you want to display.

Regards
Espen Harlinn


There is a number of issues about it. There is no need to support surrogate pairs, they are supported automatically by OS (Windows 2000 needs a tweak to support them, later versions of Windows are bundled with surrogate support).

The notion of surrogate pair is only relevant to two UTF-16 encodings (UTF-16LE and UTF16BE); UTF-32 and UTF-8 support characters beyond BMP (Basic Multilingual Plane) directly or using UTF-8 algorithm, respectively. In application memory, UTF-16LE is used; and a character type does not really represent a Unicode code point: some code points are represented as two characters, as you correctly point out, so some care is needed to index characters, see below.

One can use characters above BMP in UI directly, without any re-coding. The text should be placed in XML resources. As XML files can declare UTF-8 charset, anyone can type such text directly using any editor capable of saving data in UTF-8 format:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />



The XML file will be embedded as a resource in the .NET Assembly; during run-time, the text will be loaded/converted into UTF-16 memory representation with the code point above BMP represented as surrogate pairs. In principle, such UTF-8 text can even be entered in C# code in the form of hard-coded string literals, but I would strongly recommend to avoid it. Any hard-coded string literals, even ASCII-only are best avoided in the code, with rare exclusions.

The biggest concern is deployment of fonts implementing code point ranges above BMP. From what I know, no such fonts are bundled with Windows. However, I tested Unicode implementation above BMP using some Open Source fonts and had no problems with them.

The mixed-size nature of character string is reflected in the members of abstract class System.Text.Encoding. For example, look at the following methods of this class: GetByteCount, GetBytes, GetCharCount, GetChars. They reflect the fact that there is no one-to-one correspondence between bytes and chars: these methods accept a string of char[] parameter on input.

There is no direct access to character indexing though. I would guess, this is because this information is rarely used and needs a lot of redundant data (see below). Controls process surrogate pairs automatically. If necessary, anyone can build such index in code. To do that, one need to create a separate index map represented by index set, for example, as array of integers.

Traverse the string's "characters" (in the .NET sense, not code points) in a loop and for every character examine it using predicates (static methods): System.Char.IsLowSurrogate(char), IsHighSurrogate or IsSurrogate, incrementing the code point index correspondently: by 1 per one "real" character (representing a code point) or per two "surrogate" characters representing a surrogate pair.
When you obtain the indexing map, you can index a string by code points and use other functions in code point semantics.

The implementation would look like this (not tested):

public class CodePointIndexer {

    public CodePointIndexer(string value) {
        this.value = value;
        indexMap = BuildCodePointMap(value);
    } //CodePointIndexer

    public string Value { get { return this.value; } }

    public char[] this[int index] { //may throw out-of-range exception
        get {
            int codePointIndex = this.indexMap[index];
            char start = value[codePointIndex];
            if (System.Char.IsSurrogate(start))
                return new char[] { start };
            else
                return new char[] { start, value[codePointIndex + 1] };
        } //get this as code point
    } //this

    String value;
    int[] indexMap;

    #region implementation

    static int[] BuildCodePointMap(string source) {
        if (source == null) return null;
        if (source.Length < 1) return new int[] { };
        System.Collections.Generic.List<int> list =
            new System.Collections.Generic.List<int>();
        int currectIndex = 0;
        bool surrogateMode = false;
        foreach (char @char in source) {
            list.Add(currectIndex);
            if (surrogateMode) continue;
            surrogateMode = System.Char.IsSurrogate(@char);
            currectIndex++;
        } //loop
        return list.ToArray();
    } //BuildCodePointMap

    #endregion implementation

} //class CodePointIndexer



Sorry if I did not list comprehensive set of relevant .NET APIs — working above BMP is quite exotic requirement. At the same time, the methods I already mentioned are enough to implement any Unicode computing task.

—SA


If you character is beyond the BMP (and 2A601 is > 0xFFFF e. g. decimal 173569) then you will have a high- as well as low-surrogate within your string that encodes your codepoint. This means that TWO elements e. g. TWO words encode ONE character. Length will always obtain the number of array elements, not the number of characters/codepoints! This is true due to the fact that codepoints within a utf-16 stream appear as a dword if greater than 0xFFFF. Because a high- and a low-surrogate are TWO words, the length of 2 is as appropriate. Length means "number of elements" on an array, not as you expect "CharCount" or "CodepointCount".

There is a class called StringInfo that should do the job you are looking for. It checks for surrogate-pairs (and hopefully skips orphaned surrogates) and obtains the number of codepoints, not array elements. Try it.

If your control that you want the codepoint to display with is surrogate-aware, it will decode the codepoint that is encoded within the high- and low-surrogate pair and queries the configured font for the glyph. Be sure you have configured a font that has the proper glyph for your codepoint (e. g. Arial Unicode MS has many glyphs but not all).

kind regards,
yb


这篇关于在.Net中使用代理对的Unicode的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆