如何在不使用BOM且以非ASCII字符开头的情况下针对文件识别不同的编码? [英] How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

查看:88
本文介绍了如何在不使用BOM且以非ASCII字符开头的情况下针对文件识别不同的编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当尝试识别不带BOM的文件的编码时出现问题,尤其是当文件以非ASCII字符开头时。

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters.

我发现以下两个有关如何识别文件编码的主题,

I found following two topics about how to identify encodings for files,

Java:阅读器和编码

当前,我创建了一个类来标识文件的不同编码(例如UTF-8,UTF-16,UTF-32,UTF -16没有BOM等)

Currently, I created a class to identify different encodings for files (e.g. UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

}

上面的代码可以在所有情况下正常工作,除非文件没有BOM且以非ASCII字符开头。由于在这种情况下,用于检查文件是否仍为没有BOM的UTF-16的逻辑将无法正常工作,并且默认情况下会将编码设置为UTF-8。

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default.

如果有一种方法可以检查没有BOM的文件的编码并以非ASCII字符开头,尤其是对于UTF-16 NO BOM文件?

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ?

谢谢,任何想法都会受到赞赏。

Thanks, any idea would be appreciated.

推荐答案

最好的方法是不要尝试自己实施。而是使用现有的库来执行此操作;参见 Java:如何确定正确的字符集编码流。例如:

The best approach is not to try and implement this yourself. Instead use an existing library to do this; see Java : How to determine the correct charset encoding of a stream. For instance:

  • http://code.google.com/p/juniversalchardet/
  • http://jchardet.sourceforge.net/
  • http://site.icu-project.org/
  • http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
  • http://docs.codehaus.org/display/GUESSENC/Home

应注意,最好的办法是猜测文件的最可能编码。在一般情况下,不可能100%确定正确的编码;即创建文件时使用的编码。

It should be noted that the best that can be done is to guess at the most likely encoding for the file. In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; i.e. the encoding that was used when creating the file.


第三方库也无法识别我遇到的文件的编码,可以对其进行改进以满足我的要求。

I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement.

,您可能会意识到您的要求非常难以满足...并对其进行了更改;例如

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; e.g.


  • 将自己限制为一组编码,

  • 坚持提供/上传者文件正确说明其编码(或主要语言)是什么,和/或

  • 接受您的系统将在一定百分比的时间内弄错它的信息,并提供相应的手段

面对事实:这是一个理论上无法解决的问题。

Face the facts: this is a THEORETICALLY UNSOLVABLE problem.

这篇关于如何在不使用BOM且以非ASCII字符开头的情况下针对文件识别不同的编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆