识别单字节和多字节字符 [英] identify single byte and multibyte characters

查看:173
本文介绍了识别单字节和多字节字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好,

我是拉克什曼.

我有一个十六进制值的字符串数组.
例如:{0x0,0x31,0xef,.....}

我想做的是识别单字节和多字节字符.

我该如何识别?

在此先感谢.

问候

Lakshman

Hello,

I am Lakshman.

I am having a string array of Hexadecimal values.
For ex : {0x0, 0x31, 0xef, .....}

What i want to do is I want to identify single byte and multibyte characters .

How can i identify this?

Thanks in Advance.

regards

Lakshman

推荐答案

.NET中没有多字节字符"概念. .NET字符使用UTF-16LE Unicode编码.此编码由16位代码点组成.这还不足以表示所有Unicode代码点,仅表示BMP(基本多肺平面)(表示从0到0xFFFF(包括两端)的代码点).

与BMP之上的其他Unicode代码点有什么关系?仅在字符串级别支持它们,但不支持单个字符.这很难解释.在UTF-16编码中,此类代码点由代理对表示.仅UTF-16编码使用它们,但是有一个专门用于替代的特殊代码点范围,不应将其用于任何读取"代码点.代理的每个16位字都不是真实的代码点,它仅使用某些不存在的代码点作为真实字符.在字符串级别,一对被视为单个字符(例如,在屏幕上正确显示为一个字符字形). Windows 2000 Service Pack之一引入了此支持.

在字符级别,没有4字节字符.这意味着,如果逐字符遍历字符串,则某些字符可能不是真实"字符,因此,至少对于使用BMP以上代码点的语言,应避免这种遍历.您不应该破坏代理对,但可以使用char类型,这会导致问题.

为了解决这些问题,设计了类型System.Text.Encoding.您可以使用GetBytes方法将任何字符串序列化为字节数组(而不是字符!),或者使用ToString方法将字节数组反序列化为字符串.
要解决您的问题,您需要知道数组代表什么.您显示的不是字符串数组",而是整数数组.对于整数,没有十六进制"或十进制"之类的东西.如果这是一个16位整数数组,并且每个元素代表一个字符或代理,则可以将其序列化为字节数组,然后使用System.Text.Encoding.ToString将其反序列化为字符串.正确的序列化取决于数组的字节序"(低字节序或高字节序).你在哪里得到它?它代表任何有意义的字符串吗?您仍然可以尝试.如果您遇到问题,请发布有效的数据示例,我会看到的.

参见:
http://unicode.org/ [ ^ ].
http://unicode.org/faq/utf_bom.html [ http://msdn.microsoft.com/en-us/library/system.text. encoding.aspx [^ ].

—SA
There are no "multibyte character" concept in .NET. The .NET characters use UTF-16LE Unicode encoding. This encoding is composed of 16-bit code points. This is not enough to represent all Unicode code points, only BMP (Base Multilungual Plane), which represent code points from 0 to 0xFFFF, inclusively.

What to do with other Unicode code points, above BMP? They are supported only at the level of strings but not individual characters. This is pretty hard to explain. In UTF-16 encoding, such code points are represented by surrogate pairs. Only UTF-16 encodings use them, but there is a special range of code point dedicated to surrogates which should no be used for any "read" code points. Each 16-bit word of the surrogate is not a real code point, it only uses the position of some code point which does not exist as a real character. At the level of string, a pair is considered as a single character (for example, correctly rendered on screen as one character glyphs). This support was introduces in one of the Windows 2000 service packs.

At the level of characters, there are no 4-byte characters. It means that if you traverse a string character-by-character, some characters may be not "real" characters, so such traversal should avoided, at least for languages utilizing above-BMP code points. You should not break the surrogate pairs but you can with the char type which can cause problems.

To address these problems, the type System.Text.Encoding is designed. You can serialize any string to array of bytes (not characters!) using GetBytes methods or deserialize the array of bytes into string using the method ToString.

To solve your problem, you need to know what is represented by your array. What you show is not a "string array", this is an integer array. For integers, there is no such thing as "hex" or "decimal". If this is an array of 16-bit integer and if each element represent a character or surrogate, you probably can serialize it into array of byte and then deserialize it into string using System.Text.Encoding.ToString. Correct serialization depends on "endianess" (low-endian or high-endian) of the array. Where did you get it? Does it represent any sensible string. You can try it anyway. If you face the problem, post a valid data sample, I''ll see.

See:
http://unicode.org/[^].
http://unicode.org/faq/utf_bom.html[^],
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^].

—SA


这篇关于识别单字节和多字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆