猜测在Java中表示为byte []的文本的编码 [英] Guessing the encoding of text represented as byte[] in Java

查看：96 发布时间：2016/11/19 12:39:33 java encoding utf-8 character-encoding

本文介绍了猜测在Java中表示为byte []的文本的编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

给定一个字节数组，表示一些未知编码中的文本（通常是UTF-8或ISO-8859-1，但不一定是这样），最好的方法是获得最可能的编码的猜测）？

值得注意的是：

没有额外的元数据。字节数组是字面上唯一可用的输入。

检测算法显然不会是100％正确。如果算法在80％以上的情况下是正确的。

解决方案

p>以下方法使用 juniversalchardet 解决问题，这是Mozilla的编码的Java端口检测库。

  public static String guessEncoding（byte [] bytes）{
 String DEFAULT_ENCODING =UTF- ; 
 org.mozilla.universalchardet.UniversalDetector detector = 
 new org.mozilla.universalchardet.UniversalDetector（null）; 
 detect.handleData（bytes，0，bytes.length）; 
 detector.dataEnd（）; 
 String encoding = detect.getDetectedCharset（）; 
 detector.reset（）; 
 if（encoding == null）{
 encoding = DEFAULT_ENCODING; 
} 
 return encoding; 
}

上面的代码已经过测试并按照意图工作。只需向类路径中添加 juniversalchardet-1.0.3.jar 即可。 / p>

我已测试过 juniversalchardet 和 jchardet 。我的一般印象是juniversalchardet提供更好的检测精度和两个库的更好的API。

Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

No additional meta-data is available. The byte array is literally the only available input.
The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.

解决方案

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

这篇关于猜测在Java中表示为byte []的文本的编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

猜测在Java中表示为byte []的文本的编码 [英] Guessing the encoding of text represented as byte[] in Java

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

猜测在Java中表示为byte []的文本的编码 [英] Guessing the encoding of text represented as byte[] in Java

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭