如何处理java编码问题(特别是xml)? [英] How to deal with java encoding problems (especially xml)?

查看:123
本文介绍了如何处理java编码问题(特别是xml)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索了关于java和编码,并没有找到一个资源,解释如何处理在java编码和解码字符串时出现的commons问题。
有关于单一错误的很多具体问题,但我没有找到广泛的回复/参考指南的问题。
主要问题是:



什么是String编码?



为什么在Java中我可以阅读文件错误的charatecters?



为什么在处理xml时我得到无效字节x的y字节UTF-8序列异常?什么是主要原因以及如何避免这些问题?

解决方案

自从Stackoverflow鼓励自我回答我试图回应自己。 p>

编码是将数据从一种格式转换为另一种格式的过程,此响应详细说明了如何使用Java
中的String编码(您可能想要阅读更多



简介



字符串编码/解码是将byte []转换为String的过程,反之亦然。



一见钟情,您可能会认为没有问题,
,但如果我们对过程更深入,可能会出现一些问题。
在最低级别,信息以字节存储/传输:文件是字节序列,网络通信是通过发送和接收字节完成的。
因此,每次您想要读取或写入具有可读内容的文件,或者每次提交Web表单/阅读网页时,都会进行底层编码操作。
我们从java中的基本String编码操作开始;从字节序列创建一个字符串。
以下代码将字节[](字节可能来自文件或套接字)转换为字符串。

  byte [] stringInByte = new byte [] {104,101,108,108,111}; 
String simple = new String(stringInByte);
System.out.println(simple =+ simple); //打印简单= hello



到目前为止这么好,所有的简单。字节的值取自 here ,该图显示了将字母和数字映射到字节的一种方法
让我们用简单的要求使示例复杂化,byte []包含€(欧元)符号; oops,在ascii表中没有欧元符号。



这可以粗略地概括为问题的核心,人类可读的字符(连同一些其他必要的字符如回车,换行等)超过256,
即不能用一个字节表示。
如果由于某种原因,您必须坚持使用单字节表示(即历史原因,第一个编码表仅使用7个字节,空间约束原因,
如果磁盘上的空间有限,并且您写文本只有英文人员的文件不需要包含意大利的字母,如è,ì)你有选择哪些
字符来表示的问题。



选择编码是选择字节和字符之间的映射。



回到欧元的例子,坚持一个字节 - >一个char映射ISO8859-15编码表有€符号;
表示字符串hello€的字节序列如下:

  byte [] stringInByte1 = new字节[] {104,101,108,108,111,32,(字节)164}; 

你如何告诉java使用哪个编码进行转换?
String有构造函数

  String(byte [] bytes,String charsetName)

允许指定映射
如果您使用不同的字符集,可以获得不同的输出结果,如下所示:

  byte [] stringInByte1 = new byte [] {104,101,108,108,111,32,(byte)164}; 
String simple1 = new String(stringInByte1,ISO8859-15);
System.out.println(simple1 =+ simple1); // print simple1 = hello€

String simple2 = new String(stringInByte1,ISO8859-1);
System.out.println(simple2 =+ simple2); // print simple1 = hello¤

所以这就解释了为什么你读一些字符并读取不同的编码用于写入(String to byte [])与用于读取(byte []到String)不同。
相同的字节可能会以不同的编码映射到不同的字符,因此某些字符可能会看起来很奇怪。

这些是理解String编码所需的基本概念;让我们再复杂一点。
可能需要在一个文本文档中表示超过256个符号,以便实现此多字节编码已创建。



使用多字节编码时,不再有一个字节 - >一个字符映射,但是有一系列字节 - >一个字符映射 / p>

最着名的多字节编码之一是UTF-8; UTF-8是可变长度编码,一些字符用一个字节表示,其他字符多于一个;



UTF-8与一些字节编码重叠,如us7ascii或ISO8859-1;它可以被视为一个字节编码的扩展。



让我们看看UTF-8在第一个例子中的作用

  byte [] stringInByte = new byte [] {104,101,108,108,111}; 
String simple = new String(stringInByte);
System.out.println(simple =+ simple); // prints simple = hello

String simple3 = new String(stringInByte,UTF-8);
System.out.println(simple3 =+ simple3); //也打印简单= hello

正如你可以看到尝试的代码打印出你好,即以UTF-8和ISO8859-1表示hello的字节是相同的。



但是如果您尝试使用€符号的样本?

  byte [] stringInByte1 = new byte [] {104,101,108,108,111,32 (字节)164}; 
String simple1 = new String(stringInByte1,ISO8859-15);
System.out.println(simple1 =+ simple1); //打印simple1 = hello

String simple4 = new String(stringInByte1,UTF-8);
System.out.println(simple4 =+ simple4); //打印simple4 = hello?

意味着char无法识别,并且存在错误。
请注意,即使转换过程中出现错误,您也不会例外。



不幸的是,所有的java类都不一样处理无效字符时的方式;看看当我们处理xml时会发生什么。



管理XML



之前通过这些例子值得一提的是在Java InputStream / OutputStream中读/写字节和Reader / Writer读/写字符。



让我们尝试以一些不同的方式读取xml的字节序列,即读取文件以获得String与读取文件以获取DOM 。

  //创建一个xml文件
String xmlSample =<?xml version = \1.0\ encoding = \UTF-8\?> \\\
< specialchars>àèìòù€< / specialchars>;
try(FileOutputStream fosXmlFileOutputStreame = new FileOutputStream(test.xml)){
//使用错误的编码编写
fosXmlFileOutputStreame.write(xmlSample.getBytes(ISO8859-15 ));
}

try(
FileInputStream xmlFileInputStream = new FileInputStream(test.xml);
//读取在xml头文件中声明的编码的文件
InputStreamReader inputStreamReader = new InputStreamReader(xmlFileInputStream,UTF-8);
){
char [] cbuf = new char [xmlSample.length()];
inputStreamReader.read(cbuf);
System.out.println(用UTF-8读取的文件=+新的String(cbuf));
//打印
//使用UTF-8读取文件=<?xml version =1.0encoding =UTF-8?>
//< specialchars> < / specialchars>
}


文件xmlFile = new File(test.xml);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
文档doc = dBuilder.parse(xmlFile);
// throws




com.sun.org.apache $ x

$ p第一种情况的结果是一些奇怪的字符,但没有异常,在第二种情况下,你会得到一个异常(无效的序列....)
发生这种异常是因为你正在读取一个UTF-8序列的三个字节的字符并且第二个字节有一个无效值(因为UTF-8编码字符的方式)。



棘手的部分是,由于UTF-8与一些其他编码无效字节2的3字节UTF-8序列异常出现随机
(即仅对于具有由多个字节表示的字符的消息),因此在生产环境中,错误可能难以跟踪并重现。



有了这些信息,我们可以尝试回答以下问题:



<集团ktote>

为什么要获取y字节的无效字节x UTF-8序列读取/处理xml文件时出现异常?


因为与用于写入的编码(上述测试用例中的ISO8859-15)和读取编码(上述测试用例中的UTF-8)不符;不匹配可能有一些不同的原因:


  1. 您在字节和字符之间进行了一些错误的转换:例如,如果您正在阅读一个带有InputStream并转换成Reader并将Reader传递到xml库
    的文件,您必须在下面的代码中指定字符集名称(即必须知道用于保存文件的编码)



    try(
    FileInputStream xmlFileInputStream = new FileInputStream(test.xml);
    //这是xml库的读者(DOM4J,JDOM) )
    //如果您指定了错误的编码,则UTF-8是文件编码,或者您没有apsecify您可能会遇到的任何编码y字节UTF-8序列的无效字节x异常
    InputStreamReader inputStreamReader = new InputStreamReader(xmlFileInputStream,UTF-8);


  2. 将InputStream直接传递到xml库,但文件不是正确(如第一个管理xml的例子) e标题表示UTF-8,但实际编码为ISO8859-15。
    简单地放入文件的第一行是不够的;该文件必须使用标题中使用的编码进行保存。


  3. 您正在使用读者创建文件而不指定编码,平台编码与文件编码不同:

      FileReader fileReader = new FileReader(text.xml); 


这导致了一个方面,至少我是java中大部分String编码问题的根源:使用默认平台编码



当您调用

 Hello.getBytes(); 

您可以在不同的操作系统上获得不同的结果;这是因为在Windows上的默认编码是Windows-1252,而在linux上可能是UTF-8;
$ char被编码不同,所以你不仅获得不同的字节,还有不同的数组大小:

  String helloEuro = 你好 
//打印hello欧元字节[] size in iso8859-15 = 7
System.out.println(hello euro byte [] size in iso8859-15 =+ helloEuro.getBytes(ISO8859- 15\" )的长度)。
//打印hello欧元字节[] size in utf-8 = 9
System.out.println(hello euro byte [] size in utf-8 =+ helloEuro.getBytes(UTF- 8\" )的长度);

使用String.getBytes()或新的String(byte [] ...)指定一个编码是编码问题时首先要做的检查。



第二个是检查是否使用FileReader或FileWriter读取或写入文件;在这两种情况下,文档指出:



此类的构造函数假定默认字符编码和默认字节缓冲区大小可以接受



与使用读写器不同的平台上的String.getBytes()读取/写入相同的文件,并且没有指定字符集可能会导致由于不同的默认平台编码而导致不同的字节序列



解决方案如javadoc所示,是使用OutputStreamReader / OutputStreamWriter将OutputStream / InputStream与charset规范一起包装。



关于一些xml库如何读取XML内容的一些最后的说明:


  1. 如果你传递一个Reader库依赖于读者的编码(即它不检查什么是xml头文件),并没有什么关于编码,因为我t正在读取字符而不是字节。


  2. 如果传递InputStream或File库依赖于xml标头的编码,并且可能会导致一些编码异常


数据库



处理数据库时可能会出现不同的问题;当创建数据库时,它具有用于保存varchar和string列(如clob)的encoding属性。
如果使用8位编码(例如ISO8859-15)创建数据库,则尝试插入编码不允许的字符时,可能会出现问题。
数据库中保存的内容可能与Java级别指定的字符串不同,因为Java字符串在UTF-16的内存中表示为比数据库级别指定的内容更广泛。
最简单的解决方案是:使用UTF-8编码创建数据库。



web
这个是一个很好的起点。 p>

如果您觉得遗失的东西在评论中可以自由地要求更多的东西。


I searched about java and encoding and I did not found a resource explaining how to deal with commons problems that arise in java when encoding and decoding strings. There are a lot of specific questions about single errors but I did not found a wide response/reference guide to the problem. The main questions are:

What is String encoding?

Why in Java can I read files with wrong charatecters?

Why when dealing with xml I got Invalid byte x of y-byte UTF-8 sequence Exception? What are the main causes and how to avoid them?

解决方案

Since Stackoverflow encourages self answers I try to respond to myself.

Encoding is the process of converting data from one format to another, this response I details how String encoding works in Java (you may want to read this for a more generic introduction to text end encoding).

Introduction

String encoding/decoding is the process that transforms a byte[] into a String and vice-versa.

At a first sight you may think that there are no problems, but if we look more deeply to the process some issues may arise. At the lowest level information is stored/transmitted in bytes: files are a sequence of bytes and network communication is done by sending and receiving bytes. So every time you want to read or write a file with plain readable content or every time you submit a web form/read a web page there is an underlying encoding operation. Let's start from the basic String encoding operation in java; creating a String from a sequence of bytes. The following code converts a byte[] (the bytes may come from a file or from a socket) into a String.

    byte[] stringInByte=new byte[]{104,101,108,108,111};
    String simple=new String(stringInByte);
    System.out.println("simple=" + simple);//prints simple=hello

so far so good, all "simple". The value of the bytes are taken from here which shows one way to map letters and numbers to bytes Let's complicate the sample with a simple requirement the byte[] contains the € (euro) sign; oops, there is no euro symbol in the ascii table.

This can be roughly summarized as the core of the problem, the human readable characters (together with some other necessary ones such as carriage return, line feed, etc) are more than 256, i.e. it cannot be represented with only one byte. If for some reason you must stick with a single byte representation (i.e. historical reasons the first encoding tables were using only 7 bytes, space constraints reason, if the space on the disk is limited and you write text documents only for English people there is not need to include Italian letters with an accent such as è,ì) you have the problem of choosing which characters to represent.

Choosing an encoding is choosing a mapping between bytes and chars.

Coming back to the euro example and sticking with one byte --> one char mapping the ISO8859-15 encoding table has the € sign; The sequence of bytes for representing the string "hello €" is the following one

byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};

How do you "tell" to java which encoding to use for the conversion? The String has the constructor

String(byte[] bytes, String charsetName)

That allows to specify "the mapping" If you use different charsets you get different output results as you can see below:

    byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};
    String simple1=new String(stringInByte1,"ISO8859-15");
    System.out.println("simple1=" + simple1);  //prints simple1=hello €     

    String simple2=new String(stringInByte1,"ISO8859-1");
    System.out.println("simple2=" + simple2);   //prints simple1=hello ¤

So this explains why you read some characters and read different one the encoding used for writing (String to byte[]) is different from the one used for reading (byte[] to String). The same byte may map to different characters in different encoding so some characters may "look strange".
These are the basic concepts needed to understand String encoding; let's complicate the matter a little bit more. There may be the need to represent more than 256 symbols in one text document, in order to achieve this multi byte encoding have been created.

With multibyte encoding there is no more one byte --> one char mapping but there is sequence of bytes --> one char mapping

One of the most known multibyte encoding is UTF-8; UTF-8 is a variable length encoding, some chars are represented with one byte some others with more than one;

UTF-8 overlaps with some one byte encoding such as us7ascii or ISO8859-1; it can be viewed as an extension of one byte encoding.

Let see UTF-8 in action for the first example

    byte[] stringInByte=new byte[]{104,101,108,108,111};
    String simple=new String(stringInByte);
    System.out.println("simple=" + simple);//prints simple=hello

    String simple3=new String(stringInByte, "UTF-8");
    System.out.println("simple3=" + simple3);//also this prints simple=hello

As you can see trying the code it prints hello, i.e. the bytes to represent hello in UTF-8 and ISO8859-1 are the same.

But if you try the sample with the € sign you got a ?

    byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};
    String simple1=new String(stringInByte1,"ISO8859-15");
    System.out.println("simple1=" + simple1);//prints simple1=hello

    String simple4=new String(stringInByte1, "UTF-8");
    System.out.println("simple4=" + simple4);//prints simple4=hello ?

meaning that the char is not recognized and that there is an error. Note that you get no exception even if there is an error during the conversion.

Unfortunately not all java classes behave the same way when dealing with invalid chars; let see what happens when we deal with xml.

Managing XML

Before going through the examples is worth remembering that in Java InputStream/OutputStream read/write bytes and Reader/Writer read/write characters.

Let's try to read the sequence of bytes of a xml in some different ways, i.e reading files in order to get a String vs reading the file in order to get a DOM.

    //Create a xml file
    String xmlSample="<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<specialchars>àèìòù€</specialchars>";
    try(FileOutputStream fosXmlFileOutputStreame= new FileOutputStream("test.xml")) {
        //write the file with a wrong encoding
        fosXmlFileOutputStreame.write(xmlSample.getBytes("ISO8859-15"));
    }

    try (
            FileInputStream xmlFileInputStream= new FileInputStream("test.xml");
            //read the file with the encoding declared in the xml header
            InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");
    ) {
        char[] cbuf=new char[xmlSample.length()];
        inputStreamReader.read(cbuf);
        System.out.println("file read with UTF-8=" + new String(cbuf)); 
        //prints
        //file read with UTF-8=<?xml version="1.0" encoding="UTF-8"?>
        //<specialchars>������</specialchars>
    }


    File xmlFile = new File("test.xml");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(xmlFile);     
    //throws  

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence

In the first case the result are some strange chars but no Exception, in the second case you get an exception (Invalid sequence....) The exception occurs because you are reading a three bytes char of a UTF-8 sequence and the second byte has an invalid value (because of the UTF-8 way of encoding chars).

The tricky part is that since UTF-8 overlaps with some other encoding the Invalid byte 2 of 3-byte UTF-8 sequence exceptions arise "random" (i.e. only for the messages with characters represented by more than one byte), so in production environment the error can be difficult to track and to reproduce.

With all these information we can try to answer to the following question:

Why do I get Invalid byte x of y-byte UTF-8 sequence Exception when reading/dealing with a xml file?

Because there is a mismatch from the encoding used for writing (ISO8859-15 in the test case above) and the encoding for reading (UTF-8 in the test case above); the mismatch may have some different causes:

  1. you are making some wrong conversion between bytes and char: for example if you are reading a file with a InputStream and converting into a Reader and passing the Reader to the xml library you must specify the charset name as in the following code (i.e. you must know the encoding used for saving the file)

    try ( FileInputStream xmlFileInputStream= new FileInputStream("test.xml"); //this is the reader for the xml library (DOM4J, JDOM for example) //UTF-8 is the file encoding if you specify a wrong encoding or you do not apsecify any encoding you may face Invalid byte x of y-byte UTF-8 sequence Exception InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8"); )

  2. you are passing the InputStream directly to xml library but the file the file is not correct (as in first the example of managing xml where the header states UTF-8 but the real encoding is ISO8859-15. Simply putting in the first line of the file is not enough; the file must be saved with the encoding used in the header.

  3. you are reading the file with a reader created without specifying an encoding and the platform encoding is different from file encoding:

    FileReader fileReader=new FileReader("text.xml");
    

This lead to one aspect that at least for me it is the source of the most of the String encoding problems in java: using the default platform encoding

When you call

"Hello €".getBytes();

you can get different results on different operating systems; this is because on windows the default encoding is Windows-1252 while on linux it may be UTF-8; the € char is encoded differently so you get not only different bytes but also different array sizes:

    String helloEuro="hello €";
    //prints hello euro byte[] size in iso8859-15 = 7
    System.out.println("hello euro byte[] size in iso8859-15 = " + helloEuro.getBytes("ISO8859-15").length);
    //prints hello euro byte[] size in utf-8 = 9
    System.out.println("hello euro byte[] size in utf-8 = " + helloEuro.getBytes("UTF-8").length);

Using String.getBytes() or new String(byte[] ...) without specifying an encoding is the first check to do when you run into encoding issues

The second one is checking if you are reading or writing files using FileReader or FileWriter; in both cases the documentation states:

The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable

As with String.getBytes() reading/writing the same file on different platforms with a reader/writer and without specifying the charset may lead to different byte sequence due to different default platform encoding

The solution, as the javadoc suggest is to use OutputStreamReader/OutputStreamWriter that wraps an OutputStream/InputStream together with a charset specification.

Some final notes on how some xml libraries read XML content:

  1. if you pass a Reader the library relies on the reader for the encoding (i.e. it does not check what the xml header says) and does not anything about encoding since it is reading chars not bytes.

  2. if you pass an InputStream or a File library relies on the xml header for the encoding and it may throw some encoding Exceptions

Database

A different issue may arise when dealing with databases; when a database is created it has an encoding property used to save the varchar and string column (as clob). If the database is created with a 8 bit encoding (ISO8859-15 for example) problems may arise when you try to insert chars not allowed by the encoding. What is saved on the db may be different from the string specified at Java level because in Java strings are represented in memory in UTF-16 which is "wider" than the one specified at the database level. The simplest solution is : create you database with a UTF-8 encoding.

web this is a very good starting point.

If you feel something is missing feel free to ask for something more in the comments.

这篇关于如何处理java编码问题(特别是xml)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆