Java:读取器和编码 [英] Java: Readers and Encodings

查看:21
本文介绍了Java:读取器和编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Java 的默认编码是 ASCII.是的?(见下面我的编辑)

Java's default encoding is ASCII. Yes? (See my edit below)

当文本文件以 UTF-8 编码时?读者怎么知道他必须使用UTF-8?

When a textfile is encoded in UTF-8? How does a Reader know that he has to use UTF-8?

我谈论的读者是:

  • FileReaders
  • BufferedReaders from Sockets
  • 来自 System.in
  • Scanner
  • ...

我们的编码取决于操作系统,这意味着以下内容并非在每个操作系统上都正确:

It turns our the encoding is depends on the OS, which means that the following is not true on every OS:

'a'== 97

推荐答案

读者如何知道他必须使用 UTF-8?

您通常在 您自己"noreferrer">InputStreamReader.它有一个采用字符编码的构造函数.例如

You normally specify that yourself in an InputStreamReader. It has a constructor taking the character encoding. E.g.

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

所有其他阅读器(据我所知)使用平台默认字符编码,这本身可能确实不是正确的编码(例如 -cough- CP-1252).

All other readers (as far as I know) uses the platform default character encoding, which may indeed not per-se be the correct encoding (such as -cough- CP-1252).

理论上您也可以根据字节顺序标记自动检测字符编码.这将几种 unicode 编码与其他编码区分开来.不幸的是,Java SE 没有任何用于此的 API,但您可以自制一个可用于替换 InputStreamReader 的 API,如上面的示例所示:

You can in theory also detect the character encoding automatically based on the byte order mark. This distinguishes the several unicode encodings from other encodings. Java SE unfortunately doesn't have any API for this, but you can homebrew one which can be used to replace InputStreamReader as in the example here above:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

编辑作为对您编辑的回复:

因此编码取决于操作系统.所以这意味着并非在每个操作系统上都是如此:

'a'== 97

不,这不是真的.ASCII 编码(包含 128 个字符,0x00 until with 0x7F) 是所有其他字符编码的基础.只有 ASCII 字符集之外的字符可能会以不同的方式显示在另一种编码中.ISO-8859 编码涵盖了 ASCII 中的字符 范围具有相同的代码点.Unicode 编码涵盖了 ISO-8859 中的字符-1 范围具有相同的代码点.

No, this is not true. The ASCII encoding (which contains 128 characters, 0x00 until with 0x7F) is the basis of all other character encodings. Only the characters outside the ASCII charset may risk to be displayed differently in another encoding. The ISO-8859 encodings covers the characters in the ASCII range with the same codepoints. The Unicode encodings covers the characters in the ISO-8859-1 range with the same codepoints.

您可能会发现这些博客中的每一个都很有趣:

You may find each of those blogs an interesting read:

  1. 每个软件开发人员绝对、肯定必须了解 Unicode 和字符集的绝对最低要求(没有任何借口!)(两者的理论性更强)
  2. Unicode - 如何正确获取字符?(两者中更实用)
  1. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (more theoretical of the two)
  2. Unicode - How to get the characters right? (more practical of the two)

这篇关于Java:读取器和编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆