字符编码检测算法 [英] Character Encoding Detection Algorithm

查看:111
本文介绍了字符编码检测算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来检测文档中的字符集。我在这里阅读Mozilla字符集检测实现:



通用字符集检测



我还发现了一个名为jCharDet的Java实现:



JCharDet



这两个都是基于使用一组静态数据进行的研究。我想知道是否有人使用任何其他实现成功,如果是这样什么?你滚自己的方法,如果是这样什么是你用来检测字符集的算法?



任何帮助将不胜感激。我不是通过Google寻找现有方法的列表,我也不想找到Joel Spolsky文章的链接 - 只是为了澄清:)



UPDATE:我做了大量的研究,并最终找到一个名为cpdetector的框架,使用可插入的方法来检测字符,请参阅:



CPDetector



这提供了BOM,chardet和ASCII检测插件。它也很容易写自己的。还有另一个框架,它提供更好的字符检测,Mozilla方法/ jchardet等...



ICU4J



这是很容易为cpdetector编写自己的插件,使用这个框架提供更准确字符编码检测算法。

解决方案

多年前,我们对邮件应用程序进行了字符集检测, 。邮件应用程序实际上是一个WAP应用程序,并且手机预期为UTF-8。有几个步骤:



通用



UTF-8,因为在字节2/3 /等的顶部位中存在特定的位模式。一旦你发现这个模式重复一定次数,你就可以确定它是UTF-8。



如果文件以UTF-16字节顺序标记开始,你可以假设文本的其余部分是该编码。否则,检测UTF-16不像UTF-8那么容易,除非你可以检测到代理对模式:但是使用代理对很少,所以通常不工作。



区域检测

b

接下来我们假设读者在某个区域。例如,如果用户看到UI以日语本地化,那么我们可以尝试检测三个主要的日语编码。 ISO-2022-JP再次向东以检测转义序列。如果失败,确定EUC-JP和Shift-JIS之间的差异不是那么简单。用户更可能接收到Shift-JIS文本,但是EUC-JP中的字符在Shift-JIS中不存在,反之亦然,因此有时候您可以得到一个很好的匹配。



对中文编码和其他区域使用相同的过程。



用户的选择

如果这些结果不令人满意,用户必须手动选择编码。


I'm looking for a way to detect character sets within documents. I've been reading the Mozilla character set detection implementation here:

Universal Charset Detection

I've also found a Java implementation of this called jCharDet:

JCharDet

Both of these are based on research carried out using a set of static data. What I'm wondering is whether anybody has used any other implementation successfully and if so what? Did you roll your own approach and if so what was the algorithm you used to detect the character set?

Any help would be appreciated. I'm not looking for a list of existing approaches via Google, nor am I looking for a link to the Joel Spolsky article - just to clarify : )

UPDATE: I did a bunch of research into this and ended up finding a framework called cpdetector that uses a pluggable approach to character detection, see:

CPDetector

This provides BOM, chardet (Mozilla approach) and ASCII detection plugins. It's also very easy to write your own. There's also another framework, which provides much better character detection that the Mozilla approach/jchardet etc...

ICU4J

It's quite easy to write your own plugin for cpdetector that uses this framework to provide a more accurate character encoding detection algorithm. It works better than the Mozilla approach.

解决方案

Years ago we had character set detection for a mail application, and we rolled our own. The mail app was actually a WAP application, and the phone expected UTF-8. There were several steps:

Universal

We could easily detect if text was UTF-8, as there is a specific bit pattern in the top bits of bytes 2/3/etc. Once you found that pattern repeated a certain number of times you could be certain it was UTF-8.

If the file begins with a UTF-16 byte order mark, you can probably assume the rest of the text is that encoding. Otherwise, detecting UTF-16 isn't nearly as easy as UTF-8, unless you can detect the surrogate pairs pattern: but the use of surrogate pairs is rare, so that doesn't usually work. UTF-32 is similar, except there are no surrogate pairs to detect.

Regional detection

Next we would assume the reader was in a certain region. For instance, if the user was seeing the UI localized in Japanese, we could then attempt detection of the three main Japanese encodings. ISO-2022-JP is again east to detect with the escape sequences. If that fails, determining the difference between EUC-JP and Shift-JIS is not as straightforward. It's more likely that a user would receive Shift-JIS text, but there were characters in EUC-JP that didn't exist in Shift-JIS, and vice-versa, so sometimes you could get a good match.

The same procedure was used for Chinese encodings and other regions.

User's choice

If these didn't provide satisfactory results, the user must manually choose an encoding.

这篇关于字符编码检测算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆