猜测在Java中表示为byte []的文本的编码 [英] Guessing the encoding of text represented as byte[] in Java

查看:181
本文介绍了猜测在Java中表示为byte []的文本的编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一些字节数组,表示一些未知编码的文本(通常是UTF-8或ISO-8859-1,但不一定是这样),获取猜测最可能使用的编码的最佳方法是什么(在Java中)



值得注意的是:




  • 没有额外的元数据可用。字节数组是字面上唯一可用的输入。

  • 检测算法显然不会100%正确。如果算法是正确的,超过80%的情况就足够好了。


解决方案

p>以下方法使用 juniversalchardet 解决问题,这是Mozilla编码检测的Java端口图书馆

  public static String guessEncoding(byte [] bytes){
String DEFAULT_ENCODING =UTF-8;
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes,0,bytes.length);
detector.dataEnd();
String encoding = detect.getDetectedCharset();
detector.reset();
if(encoding == null){
encoding = DEFAULT_ENCODING;
}
返回编码;
}

上面的代码已经过测试,可以按照意图工作。只需将 juniversalchardet-1.0.3.jar 添加到类路径中。



我已经测试了 juniversalchardet jchardet 。我的一般印象是juniversalchardet提供了更好的检测精度和两个图书馆的更好的API。


Given an array of bytes representing text in some unknown encoding (usually UTF-8 or ISO-8859-1, but not necessarily so), what is the best way to obtain a guess for the most likely encoding used (in Java)?

Worth noting:

  • No additional meta-data is available. The byte array is literally the only available input.
  • The detection algorithm will obviously not be 100 % correct. If the algorithm is correct in more than say 80 % of the cases that is good enough.

解决方案

The following method solves the problem using juniversalchardet, which is a Java port of Mozilla's encoding detection library.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

The code above has been tested and works as intented. Simply add juniversalchardet-1.0.3.jar to the classpath.

I've tested both juniversalchardet and jchardet. My general impression is that juniversalchardet provides the better detection accuracy and the nicer API of the two libraries.

这篇关于猜测在Java中表示为byte []的文本的编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆