如何检索char的unicode表示? [英] how to retrieve the unicode representation for a char?

查看:75
本文介绍了如何检索char的unicode表示?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好

我想用C#获取Char的Unicode表示。



A其UnicodeU + 0041



TIA

hello
I want to get the Unicode representation of Char using C#.
Like
"A" its Unicode "U+0041"

TIA

推荐答案

private string CharToUnicodeFormat(char c)
{
    return string.Format(@"U+{0:x4}", (int)c);
}

private char UnicodeFormatToChar(string ucf)
{
    return Convert.ToChar(Convert.ToInt32(ucf.Substring(2),16));
}

请记住,与这种扩展Unicode形式语法'U + 0000相比,C#的创建者为您提供了许多处理内部Unicode表示的工具,如'\ u0000' '...其中反斜杠-u是转义Unicode字符的语法。

Keep in mind that the creators of C# give you many tools to deal with the internal Unicode representation like '\u0000' compared to this "extended" Unicode formal syntax 'U+0000' ... where backslash-u is the syntax for an "escaped" Unicode character.


这是内存中字符变量的值。您需要做的就是将其转换为整数并显示它。
That is the value of the character variable in memory. All you need to do is cast it to an integer and display it.


之前提供的答案不会超出 BMP



不幸的是,这些字符在.NET中的支持是有限的,更确切地说,有一个完整的支持,但它是间接的。简而言之,类型 System.String 是一个具有完全Unicode支持的自洽类型,但 System.Char 是not:在此类型的所有可能值的集合中,并非所有值都表示字符:一些对应于未定义的代码点,一些对应于字符,一些对应于代理对: https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U。 2B10FFFF [ ^ ]。



这是.NET技巧:在内部,字符串使用UTF-16LE编码(请参阅上面的UTF-16链接)。一旦你考虑字符串,而不是字符,一切正常。有些人使用1个16位字,但有些人,在BMP之上使用两个这样的字,一个代理对。



要小心:比如说,使用属性长度,它为您提供16位字的数量,而不是字符数。实际的字符数可能小于内存字的数量。



但是如果你提取类型的值 System.Char ,它......可能不是角色。



这是你能做的。只要您需要字符代码点,而不是代理值,您就只能使用字符串。特别是,要从字符串中获取第N个字符,请使用长度为1的子字符串(第二个参数): https://msdn.microsoft.com/en-us/library/aka44szs(v = vs.110).aspx [ ^ ]。



这样,即使是单个字符也应该表示为字符串,而不是 System.Char 的实例。



取字符( char ,这次)并索引0并检查它是否是代理对。如果以这种方式获得的字符是代理对,则按索引0和1取两个 char 值,使用 System.Char.ConvertToUtf32

https:// msdn.microsoft.com/en-us/library/wdh8k14a%28v=vs.110%29.aspx [ ^ ]。



或者,您可以通过字符串中的索引直接获取字符的UTF-32表示:https://msdn.microsoft.com/en-us/library/z2ys180b%28v=vs.110%29.aspx [ ^ ]。



现在,您需要知道UTF-32LE编码中的32位字的算术值正好是Unicode代码点的值数字。在.NET中,您可以立即获得代码点值。



在所有其他情况下,使用唯一的char,将其类型转换为 uint ;这将是你的代码点价值。







你可以理解所有的步骤上面显示的解决方案C#代码示例。我获取一些代码点值,将其转换为.NET字符串,然后检查字符串中的每个真正的Unicode字符并获取其代码点;输出计算出的代码点:

The answers provided before won't work beyond BMP.

Unfortunately, the support of such characters in .NET is somewhat limited, more exactly, there is a full support, but it is indirect. In brief, the type System.String is a self-consistent type with full Unicode support, but System.Char is not: in a set of all possible values of this type, not all values represent a character: some corresponds to an undefined code point, some to a character and some to a high or low member of a surrogate pair: https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF[^].

Here is the .NET trick: internally, strings use the encoding UTF-16LE (please see the UTF-16 link above). As soon as you consider strings, not characters, everything works right. Some characters use 1 16-bit words, but some, above BMP use two such words, a surrogate pair.

Be very careful: say, use the property Length, it gives you the number of 16-bit words, not number of characters. The actual number of characters may be less than the number of memory words.

But if you extract a value of the type System.Char, it… might be not character.

Here is what you can do. As soon as you need a character code point, not surrogate values, you gave to use strings only. In particular, to get N-th character from a string, use a sub-string of length 1 (second parameter): https://msdn.microsoft.com/en-us/library/aka44szs(v=vs.110).aspx[^].

This way, even the single characters should be represented as strings, not as instances of System.Char.

Take the character (char, this time) and index 0 and check up if it is a surrogate pair. If the character obtained this way is a surrogate pair, take two char values by index 0 and 1, use System.Char.ConvertToUtf32:
https://msdn.microsoft.com/en-us/library/wdh8k14a%28v=vs.110%29.aspx[^].

Alternatively, you can directly get UTF-32 representation of a character by its index in a string: https://msdn.microsoft.com/en-us/library/z2ys180b%28v=vs.110%29.aspx[^].

Now, you need to know that the arithmetic values of the 32-bit words in UTF-32LE encoding are exactly the values of the Unicode code point numbers. In .NET, you immediately get code point values.

In all other cases, use the only char, type-cast it to uint; that will be your code point value.



You can understand all the steps on the work-around C# code sample shown above. I take some code point values, convert it to a .NET string and then inspect each "real" Unicode character in the string and get its code point; output the calculated code points:
System.UInt32[] codePoints = new uint[] {
    // above BMP,
    // from: http://www.unicode.org/charts/PDF/U10000.pdf:
    0x10056, 0x10057, 0x10058,
    // Greek labda and mu: in BMP, but outside ASCII:
    0x03bb, 0x03bc,
    // ASCII, Latin A, B:
    0x41, 0x42,
};

// serialize it into array of bytes:
byte[] utf32Data = new byte[codePoints.Length * sizeof(uint)];
for (int index = 0; index < codePoints.Length; ++index) {
    byte[] character = System.BitConverter.GetBytes(codePoints[index]);
    System.Array.Copy(character, 0, utf32Data, index * sizeof(uint), character.Length);
}

// get string out of UTF32 data:
string value = new string(System.Text.Encoding.UTF32.GetChars(utf32Data));

// calculate and output code points:
System.Text.StringBuilder sb = new System.Text.StringBuilder("code points: ");
for (int index = 0; index < value.Length; ++index) {
    char[] character; // one or two 16-bit words is a character
    char word = value[index]; // a 16-bit word, not really a character
    if (System.Char.IsHighSurrogate(word)) {
        character = new char[] { word, value[index + 1], };
    } else if (System.Char.IsLowSurrogate(word))
        continue;
    else
        character = new char[] { word };
    int codePoint;
    if (character.Length > 1)
        codePoint = System.Char.ConvertToUtf32(character[0], character[1]);
    else
        codePoint = (int)character[0];
    sb.Append(string.Format("{0:x} ", codePoint));
}
System.Console.WriteLine(sb.ToString());



此代码示例效率不高。相反,我试图清楚地显示每一步。



[结束编辑]



您应该了解Unicode代码点正确:它们只是分配给字符的普通数字。它们是纯数学意义上的价值观,完全从计算机演示中抽象出来。同样,角色是纯粹的文化实体,完全从计算机演示和细节(如字形,字体等)中抽象出来。这种对应关系是代码Unicode。所有与计算机相关的细节都由UTF定义。



-SA


这篇关于如何检索char的unicode表示?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆