什么是最准确的编码检测器？ [英] What is the most accurate encoding detector?

查看：319 发布时间：2016/11/19 12:42:54 java character-encoding

本文介绍了什么是最准确的编码检测器？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

经过一番调查，我发现在java世界中有一些编码检测项目，如果 getEncoding 在 InputStreamReader 不起作用：

After certain survey, I come to discover that there are a few encoding detection project in java world, if the getEncoding in InputStreamReader does not work:

juniversalchardet

jchardet

cpdetector

ICU4J

juniversalchardet
jchardet
cpdetector
ICU4J

但是，我真的不知道哪一个是最好的。

However, I really do not know which is the best among the all. Can anyone with hand-on experience tell me which one is the best in Java?

推荐答案

我已经检查juniversalchardet和ICU4J on某些 CSV文件，并且结果不一致：
juniversalchardet有更好的效果：

I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:

UTF-

Windows-1255：juniversalchardet检测到有足够的希伯来字母，ICU4J仍然认为它是ISO-8859-1。

SHIFT_JIS（日语）：juniversalchardet检测到，ICU4J检测到了这个问题，并且ICU4J检测到了它的另一个希伯来语编码的ISO-8859-8。认为是ISO-8859-2。

ISO-8859-1：由ICU4J检测，不受juniversalchardet支持。

UTF-8: Both detected.
Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

因此，应该考虑他最有可能处理哪些编码。
最后，我选择了 ICU4J 。

So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.

注意ICU4J仍然保留。

Notice that ICU4J is still maintained.

还要注意，你可能想使用ICU4J，如果它返回null，因为它没有成功，尝试使用juniversalchardet。

Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.

Apache Tika 的AutoDetectReader正是这样 - 首先尝试使用HtmlEncodingDetector，然后使用UniversalEncodingDetector（基于juniversalchardet），然后尝试Icu4jEncodingDetector（基于ICU4J）。

AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).

这篇关于什么是最准确的编码检测器？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

什么是最准确的编码检测器？ [英] What is the most accurate encoding detector?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

什么是最准确的编码检测器？ [英] What is the most accurate encoding detector?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭