Java:读者和编码 [英] Java: Readers and Encodings

查看:154
本文介绍了Java:读者和编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Java的默认编码是 ASCII 。是? (见下面的编辑)



当文本文件编码为 UTF-8 时,读者知道他必须使用 UTF-8



我谈论的读者是: / p>


  • FileReader s

  • BufferedReader s from Socket s

  • A Scanner System.in

  • ...



编辑



它使我们的编码取决于操作系统,这意味着在每个操作系统上都不是这样:

 'a'== 97 


解决方案


读者如何知道他必须使用UTF-8?


您通常在自己 docs / api / java / io / InputStreamReader.htmlrel =noreferrer> InputStreamReader 。它具有采用字符编码的构造函数。例如:

  Reader reader = new InputStreamReader(new FileInputStream(c:/foo.txt),UTF-8) ; 

所有其他读者(据我所知)使用平台默认字符编码,这可能确实不(如 -cough - CP-1252 )。



您还可以根据字节顺序标记自动检测字符编码。这区分了几个unicode编码与其他编码。 Java SE不幸没有任何API,但您可以自制一个可以用于替换 InputStreamReader ,如上例所示:

  public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/ **
*在Input stream中构造UnicodeReader
* @param。
* @param defaultEncoding未找到BOM时使用的默认编码,
*或< code> null< / code>使用系统默认编码。
* @throws IOException如果发生I / O错误。
* /
public UnicodeReader(InputStream in,String defaultEncoding)throws IOException {
byte bom [] = new byte [BOM_SIZE];
字符串编码;
int未读;
PushbackInputStream pushbackStream = new PushbackInputStream(in,BOM_SIZE);
int n = pushbackStream.read(bom,0,bom.length);

//预读四个字节,并检查BOM标记。
if((bom [0] ==(byte)0xEF)&&(bom [1] ==(byte)0xBB)&(bom [2] ==(byte)0xBF) ){
encoding =UTF-8;
unread = n - 3;
} else if((bom [0] ==(byte)0xFE)&&(bom [1] ==(byte)0xFF)){
encoding =UTF-16BE
unread = n - 2;
} else if((bom [0] ==(byte)0xFF)&&(bom [1] ==(byte)0xFE)){
encoding =UTF-16LE;
unread = n - 2;
} else if((bom [0] ==(byte)0x00)&&(bom [1] ==(byte)0x00)&(bom [2] ==(byte) 0xFE)&(bom [3] ==(byte)0xFF)){
encoding =UTF-32BE;
unread = n - 4;
} else if((bom [0] ==(byte)0xFF)&&(bom [1] ==(byte)0xFE)&(bom [2] ==(byte) 0x00)&(bom [3] ==(byte)0x00)){
encoding =UTF-32LE;
unread = n - 4;
} else {
encoding = defaultEncoding;
unread = n;
}

//如有必要,未读字节,并跳过BOM标记。
if(unread> 0){
pushbackStream.unread(bom,(n - unread),unread);
} else if(unread< -1){
pushbackStream.unread(bom,0,0);
}

//使用给定的编码。
if(encoding == null){
reader = new InputStreamReader(pushbackStream);
} else {
reader = new InputStreamReader(pushbackStream,encoding);
}
}

public String getEncoding(){
return reader.getEncoding();
}

public int read(char [] cbuf,int off,int len)throws IOException {
return reader.read(cbuf,off,len);
}

public void close()throws IOException {
reader.close();
}
}

编辑在您的编辑:


所以编码取决于操作系统。所以这意味着不是每个操作系统都是这样的:

 'a'== 97 


不,这不正确。 ASCII 编码(其中包含128个字符, 0x00 直到 0x7F )是所有其他字符编码的基础。只有 ASCII 字符集之外的字符可能会在其他编码中以不同的方式显示。 ISO-8859 编码涵盖字符在 ASCII 范围内具有相同的代码点。 Unicode 编码涵盖了 ISO-8859-1 范围具有相同的代码点。



您可能会发现每个博客都是一个有趣的阅读:


  1. 绝对最小的每个软件开发人员绝对必须了解Unicode和字符集(No Excuses!)(更理论的两个)



Java's default encoding is ASCII. Yes? (See my edit below)

When a textfile is encoded in UTF-8? How does a Reader know that he has to use UTF-8?

The Readers I talk about are:

  • FileReaders
  • BufferedReaders from Sockets
  • A Scanner from System.in
  • ...

EDIT

It turns our the encoding is depends on the OS, which means that the following is not true on every OS:

'a'== 97

解决方案

How does a Reader know that he have to use UTF-8?

You normally specify that yourself in an InputStreamReader. It has a constructor taking the character encoding. E.g.

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

All other readers (as far as I know) uses the platform default character encoding, which may indeed not per-se be the correct encoding (such as -cough- CP-1252).

You can in theory also detect the character encoding automatically based on the byte order mark. This distinguishes the several unicode encodings from other encodings. Java SE unfortunately doesn't have any API for this, but you can homebrew one which can be used to replace InputStreamReader as in the example here above:

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

Edit as a reply on your edit:

So the encoding is depends on the OS. So that means that not on every OS this is true:

'a'== 97

No, this is not true. The ASCII encoding (which contains 128 characters, 0x00 until with 0x7F) is the basis of all other character encodings. Only the characters outside the ASCII charset may risk to be displayed differently in another encoding. The ISO-8859 encodings covers the characters in the ASCII range with the same codepoints. The Unicode encodings covers the characters in the ISO-8859-1 range with the same codepoints.

You may find each of those blogs an interesting read:

  1. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (more theoretical of the two)
  2. Unicode - How to get the characters right? (more practical of the two)

这篇关于Java:读者和编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆