猜测在Java中表示为byte []的文本的编码 [英] Guessing the encoding of text represented as byte[] in Java

查看:96
本文介绍了猜测在Java中表示为byte []的文本的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个字节数组,表示一些未知编码中的文本(通常是UTF-8或ISO-8859-1,但不一定是这样),最好的方法是获得最可能的编码的猜测)?



值得注意的是:




  • 没有额外的元数据。字节数组是字面上唯一可用的输入。

  • 检测算法显然不会是100%正确。如果算法在80%以上的情况下是正确的。


解决方案

p>以下方法使用 juniversalchardet 解决问题,这是Mozilla的编码的Java端口检测库。

  public static String guessEncoding(byte [] bytes){
String DEFAULT_ENCODING =UTF- ;
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detect.handleData(bytes,0,bytes.length);
detector.dataEnd();
String encoding = detect.getDetectedCharset();
detector.reset();
if(encoding == null){
encoding = DEFAULT_ENCODING;
}
return encoding;
}

上面的代码已经过测试并按照意图工作。只需向类路径中添加 juniversalchardet-1.0.3.jar 即可。 / p>

我已测试过 juniversalchardet jchardet 。我的一般印象是juniversalchardet提供更好的检测精度和两个库的更好的API。


Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.

解决方案

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

这篇关于猜测在Java中表示为byte []的文本的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆