检测字节数组的编码 [英] Detect encoding of byte array

查看:145
本文介绍了检测字节数组的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述





我陷入了与编码相关的一个非常大的问题,我在数据库中有一个列,它存储xml作为字符串,显然它们是存储的在UTF-8格式,但现在它已被更改为UTF-16.so我们需要读取不同格式的旧数据和新数据。所以我需要一种方法来查找字节数组的编码。请提供解决方案作为很快,非常非常迫切。



谢谢

Akanksha

Hi,

I am stuck in a very big issue related to encoding,i have a column in database which store xml as welll as string prviously they were stored in UTF-8 format but now it has been changed to UTF-16.so we need to read old as well as new data that are i different format.so i need a way to find encoding of byte array.Please provide a solution as soon as possible,its very very very urgent.

Thanks
Akanksha

推荐答案

严格来说,没有一种常规的100%确定的方式来告诉UTF编码的字节数组。 你犯了一个致命错误并增加了系统的熵。这个错误在理论上是不可逆的,因为封闭系统的熵不能减少。



序列化的Unicode字符串可以表示为两个组成部分:可以从 System.Text.Encoding.GetBytes(string)获取的字节数组以及编码本身的信息。您可以将此信息视为对具体运行时 Encoding 类的引用。请参阅:

http://msdn.microsoft.com /en-us/library/system.text.encoding.aspx [ ^ ]。



专用于UTF的Unicode标准部分提出了保持编码的标准机制信息与字符串数据。这是针对每个UTF不同的称为BOM(字节顺序标记)的一定数量的字节,其允许明确地检测UTF编码。请参阅:

http://en.wikipedia.org/wiki/Byte_order_mark [< a href =http://en.wikipedia.org/wiki/Byte_order_marktarget =_ blanktitle =New Window> ^ ],

http://unicode.org/faq/utf_bom.html [ ^ ]。



看起来你没有使用这个或任何其他机制,致命的后果。这是我在CodeProject遇到的第二个失败案例。



我可以看到一些修复它的方法,但这需要一些劳动力。首先,训练有素的人眼可以轻松地检测编码,只是看一下以某种方式呈现的字节。如果使用两种或更多种不同的编码将相同的字节数组反序列化为文本,那么熟悉编写系统(语言)的每个人的眼睛都可以立即检测出哪种编码是正确的。



您可以自动执行此过程。为此,您应该拥有文本中使用的语言的词典(或词典),并对使用不同假设编码反序列化的文本执行统计分析。正确的编码将是一些文本词汇与词典条目产生更多匹配的编码。您将需要进行研究,以使用人类操作员的判断来确认不同匹配级别(例如,匹配百分比)的决策的有效性级别。完成此操作并建立置信度后,您将需要通过此系统传递整个数据库。在所有可疑案例中(您应该在实验研究中制定某个与可疑案例的标准),最终解决方案应由人类操作员选择。当所有文本都转换为单个UTF时,问题将得到解决。或者,可以使用BOM,但我建议在所有情况下使用UTF-8作为修复的结果。



您需要更多小心使用融合凝聚语言,因为在这种情况下,您需要能够提取单词的词根或其他词汇单元,语素 (用于与字典比较),在一般情况下,这是语言和计算的严肃任务。请参阅:

http://en.wikipedia.org/wiki/Agglutinative_language [< a href =http://en.wikipedia.org/wiki/Agglutinative_languagetarget =_ blanktitle =New Window> ^ ],

http://en.wikipedia.org/wiki/Fusional_language [ ^ ],

http ://en.wikipedia.org/wiki/Word_root [ ^ ],

http://en.wikipedia.org/wiki/Morpheme [ ^ ]。







实际上,有一个更简单的统计标准应该可以很好地应用于许多在你使用UTF-8或UTF-16LE的情况下。



如果大多数代码落在1-2-3 Unicode子范围内,它将运行良好这通常发生在文本中有一种主导语言时。通常,有许多字符的代码点带有ASCII,而其他大多数字符都属于同一个Unicode子范围。在UTF-16LE中,子范围由每个16位字的高字节指示(BMP之外的32位字符很少见,我不认为它们)。因此,如果仅考虑高字节的分布,它们将分为1-2个主模式,更少在3个或更多模式中。通过这种方式,如果您找到一种模式,例如30%或更多(10%,20%),则更可能是UTF-16LE,然后是UTF-8,您还可以看到分布中的模式,但更多的是不同的模式。



这个技巧不适用于基于 logograms (如中文或韩文),但应该在大多数其他语言上显示出良好的效果,如西欧,斯拉夫,乌戈里,格鲁吉亚,亚美尼亚,Arabo-Persian,当然还有Brahmic脚本(大多数许多印度书写系统,泰国等等,以及更多。



[结束编辑]



总的来说,这个软件的创建可以很快完成,但研究和最终数据转换的成本可能或多或少都很昂贵。下次,在工作之前先用你的头。



-SA
Strictly speaking, there is no a regular 100% certain way to tell the UTF from the encoded array of bytes. You have done a fatal mistake and increased the entropy of the system. This mistake is theoretically not reversable, in the same sense that the entropy of a closed system cannot be reduced.

The serialized Unicode string can be represented as two components: an array of bytes which can be obtained from System.Text.Encoding.GetBytes(string) and the information of the encoding itself. You can think of this piece of information as of the reference to a concrete run-time Encoding class. Please see:
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^].

The part of the Unicode standard dedicated to UTFs suggests a standard mechanism of keeping the encoding information with string data. This is a certain number of bytes called BOM (Byte Order Mark) different for each UTF which allows for unambiguous detection of UTF encoding. Please see:
http://en.wikipedia.org/wiki/Byte_order_mark[^],
http://unicode.org/faq/utf_bom.html[^].

It looks like you failed to use this or any other mechanism, with fatal consequences. This is a second case of a failure like that I faced with at CodeProject.

I can see some ways to fix it, but it will take some labor. First, a trained human eye can easily detect the encoding just looking at the bytes rendered somehow. And if the same array of bytes is deserialized into text using two or more different encodings, an eye of every human familiar with the writing system (language) can detect which encoding was correct in no time.

You can automate this process. To do that, you should have a dictionary (or dictionaries) of the languages used in the text and perform a statistical analysis of the text deserialized with different hypothetical encodings. The right encoding would be the one which yields more matches of some text lexemes with the dictionary entries. You will need to do a research to confirm the level of validity of the decisions at different match level (say, percentage of matches) using a judgement of a human operator. When this is done and the confidence level is established, you will need to pass the whole database through this system. In all questionable cases (you should develop the criterion for a certain vs. questionable case in your experimental research), a final solution should be chosen by a human operator. The problem will be solved when all the text are converted to some single UTF. Alternatively, BOM could be used, but I would recommend to use UTF-8 in all cases, as the result of the fix.

You would need to be much more careful with fusional or agglutinative languages, because in this case you would need to be able to extract roots or other lexical units of a word, morphemes (for comparison with a dictionary), which is, in a general case, a serious task of both linguistic and computing. Please see:
http://en.wikipedia.org/wiki/Agglutinative_language[^],
http://en.wikipedia.org/wiki/Fusional_language[^],
http://en.wikipedia.org/wiki/Word_root[^],
http://en.wikipedia.org/wiki/Morpheme[^].



Actually, there is one more simple statistical criterion which should work pretty well on many situations in case you use either UTF-8 or UTF-16LE.

It will work well if most of the code falls in 1-2-3 Unicode sub-ranges, which usually happens when there is one dominating language in the text. Usually, there are many characters with code point withing ASCII, and most of the other falls in the same Unicode sub-range. In UTF-16LE the sub-range is indicated by the high byte of each of 16-bit words (32-bit characters beyond BMP are rare, I don't consider them). Therefore, if you consider only the distribution of high bytes, they will fall in 1-2 main modes, more rarely in 3 or more. In this way, if you find one mode of, say, 30% or more (10%, 20%), it's more likely UTF-16LE then UTF-8 where you will also see the modes in the distribution, but more of less distinct modes.

This trick won't work well on languages based on logograms (like Chinese or Korean), but should show good results on most other languages like Western European, Slavic, Ugric, Georgian, Armenian, Arabo-Persian and of course Brahmic script (most of numerous Indian writing systems, Thai, etc.), and a lot more.

[END EDIT]

Overall, creation of this software could be done quite soon, but the cost of the research and final conversion of data could be more or less expensive. Next time, use your head before doing your work.

—SA


判别如何utf8 / 16基于表格输入日期。
How about discriminating utf8/16 based on table entry date.


这篇关于检测字节数组的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆