什么是最准确的编码检测器? [英] What is the most accurate encoding detector?

查看:319
本文介绍了什么是最准确的编码检测器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

经过一番调查,我发现在java世界中有一些编码检测项目,如果 getEncoding InputStreamReader 不起作用:

After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding in InputStreamReader does not work:


  1. juniversalchardet

  2. jchardet

  3. cpdetector

  4. ICU4J

  1. juniversalchardet
  2. jchardet
  3. cpdetector
  4. ICU4J

但是,我真的不知道哪一个是最好的。

However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?

推荐答案

我已经检查juniversalchardet和ICU4J on某些 CSV文件,并且结果不一致:
juniversalchardet有更好的效果:

I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:


  • UTF-

  • Windows-1255:juniversalchardet检测到有足够的希伯来字母,ICU4J仍然认为它是ISO-8859-1。

  • SHIFT_JIS(日语):juniversalchardet检测到,ICU4J检测到了这个问题,并且ICU4J检测到了它的另一个希伯来语编码的ISO-8859-8。认为是ISO-8859-2。

  • ISO-8859-1:由ICU4J检测,不受juniversalchardet支持。

  • UTF-8: Both detected.
  • Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
  • SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
  • ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

因此,应该考虑他最有可能处理哪些编码。
最后,我选择了 ICU4J

So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.

注意ICU4J仍然保留。

Notice that ICU4J is still maintained.

还要注意,你可能想使用ICU4J,如果它返回null,因为它没有成功,尝试使用juniversalchardet。

Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.

Apache Tika 的AutoDetectReader正是这样 - 首先尝试使用HtmlEncodingDetector,然后使用UniversalEncodingDetector(基于juniversalchardet),然后尝试Icu4jEncodingDetector(基于ICU4J)。

AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).

这篇关于什么是最准确的编码检测器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆