检测UTF-8双字节字符 [英] Detect UTF-8 double-byte characters

查看:166
本文介绍了检测UTF-8双字节字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正努力将UTF-8文件中的一行解析为字符串数组.
我的文件包含以下内容:

 Grecki John
12345678901234 


名字"John"从第10位开始. (这里的第二个字符是UTF-8 U + 022F.)
在代码中,我需要做

 LineRead.Substring( 11  4 ) 

以获得"John",应为

 LineRead.Substring( 10   If   Not  System.Text.Encoding.UTF8的操作. GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead))= System.Text.Encoding.UTF8.GetByteCount(LineRead)然后 

,但这是"à"的情况也是如此,它在String.Length中仅计为1,但在UTF-8中具有2个字节...

如何处理这样的常见情况?
如何防止将1个字符的字节分成几个错误的字符?这样我就可以逐个字符地遍历字符串并对其进行计数?
预先感谢! microsoft.com/en-us/library/system.globalization.stringinfo.aspx">http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx [ ^ ]
无论如何还是要谢谢!!


好吧,听起来您可能已经有了解决方案,但这是我的想法,它的价值是什么.

在.NET中,字符串表示为2字节unicode字符的集合,其思想是您几乎可以将任何一组字母放入16位.因此,如果您的字符串中包含非ASCII字符,它应该没有任何区别,索引将按预期工作.

听起来像是在读取文件时,.NET会将其解释为ASCII,因此为文件中的每个单个字节创建了一个新的2字节字符,除非您有双字节UTF-8字符,否则就可以了-它会解释这是两个字符并创建4个字节,索引将以1表示.

您的索引编制失败的事实意味着情况就是这样,而且姓氏也将是错误的.

我相信在文本文件中,通常在前面通常有一个小头,它表示文件编码,但是如果缺少此文件,则.NET将假定为纯ASCII.因此,当您加载文件(使用StreamReader或其他工具)时,要明确告知文件该文件是UTF-8编码的.然后问题应该就消失了.


I''m seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:

Gȯrecki   John
12345678901234


The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do

LineRead.Substring(11,4)

to get "John", where it should be

LineRead.Substring(10,4)


with normal characters.

My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like

If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then 

but that''s also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...

How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!

解决方案

After hours of looking and finally desperately posting it on here I think I found the solution at http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx[^]
Thanks anyway though!


Well, it sounds like you may already have a solution, but here are my thoughts, for what its worth.

In .NET strings are represented as a collection of 2-byte unicode characters, the idea being that you can fit just about any set of letters into 16 bits. As such it shouldn''t make any difference if you have non-ASCII characters in your string, indexing will work as expected.

It sounds like when you are reading the file, .NET is interpreting it as ASCII and thus creating a new 2 byte character for every single byte in the file which will be fine unless you have a double byte UTF-8 character - it will intepret this as two characters and create 4 bytes and indexing will be out by 1.

The fact that your indexing is out implies that this is the case, and also the surname will be incorrect.

I believe that in text files, right at the front there is usually a little header denoting the file encoding, but if this is missing .NET assumes pure ASCII. So, when you load the file (using StreamReader or whatever) you want to explicitly inform it that the file is UTF-8 encoded. The issue should then just disappear.


这篇关于检测UTF-8双字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆