如何处理java编码问题(尤其是xml)? [英] How to deal with java encoding problems (especially xml)?

查看:25
本文介绍了如何处理java编码问题(尤其是xml)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索了有关 java 和编码的内容,但没有找到解释如何处理编码和解码字符串时 java 中出现的公共问题的资源.有很多关于单个错误的具体问题,但我没有找到该问题的广泛响应/参考指南.主要问题是:

什么是字符串编码?

为什么在 Java 中我可以读取字符错误的文件?

为什么在处理 xml 时出现 Invalid byte x of y-byte UTF-8 sequence Exception?主要原因是什么以及如何避免它们?

解决方案

由于 Stackoverflow 鼓励自我回答,我尝试对自己做出回应.

编码是将数据从一种格式转换为另一种格式的过程,此响应我详细说明了字符串编码在 Java 中的工作原理(您可能想阅读本文以获得对文本结束编码的更通用介绍).

简介

字符串编码/解码是将 byte[] 转换为 String 的过程,反之亦然.

乍一看你可能认为没有问题,但如果我们更深入地研究这个过程,可能会出现一些问题.在最底层,信息以字节为单位存储/传输:文件是一个字节序列,网络通信是通过发送和接收字节来完成的.因此,每次您想要读取或写入具有简单可读内容的文件时,或者每次提交 Web 表单/读取网页时,都有一个底层编码操作.我们先从java中基本的String编码操作说起;从字节序列创建一个字符串.以下代码将 byte[](字节可能来自文件或套接字)转换为 String.

 byte[] stringInByte=new byte[]{104,101,108,108,111};String simple=new String(stringInByte);System.out.println("simple=" + simple);//打印simple=hello

到目前为止一切顺利,一切都很简单".字节的值取自 这里,它显示了一种将字母和数字映射到字节的方法让我们通过一个简单的要求使示例复杂化,byte[] 包含 €(欧元)符号;哎呀,ascii 表中没有欧元符号.

这大致可以概括为问题的核心,人类可读的字符(连同其他一些必要的如回车、换行等)超过256个,即它不能仅用一个字节表示.如果由于某种原因您必须坚持使用单字节表示(即第一个编码表仅使用 7 个字节的历史原因,空间限制原因,如果磁盘空间有限,并且您只为英语人士编写文本文档,则不需要包含带有重音的意大利字母,例如 è,ì),您将面临选择哪个的问题字符来表示.

选择编码就是选择字节和字符之间的映射.

回到欧元的例子并坚持使用一个字节 --> 映射 ISO8859-15 编码表的一个字符具有 € 符号;表示字符串hello €"的字节序列如下

byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};

你如何告诉"java 使用哪种编码进行转换?字符串具有构造函数

String(byte[] bytes, String charsetName)

这允许指定映射"如果您使用不同的字符集,您会得到不同的输出结果,如下所示:

 byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};String simple1=new String(stringInByte1,"ISO8859-15");System.out.println("simple1=" + simple1);//打印simple1=hello €String simple2=new String(stringInByte1,"ISO8859-1");System.out.println("simple2=" + simple2);//打印simple1=hello ¤

所以这解释了为什么您读取一些字符并读取不同的字符,用于写入的编码(String to byte[])与用于读取的编码(byte[] to String)不同.相同的字节可能会映射到不同编码的不同字符,因此某些字符可能看起来很奇怪".
这些是理解字符串编码所需的基本概念;让我们把事情复杂一点.可能需要在一个文本文档中表示超过 256 个符号,为了实现这种多字节编码已经创建.

多字节编码不再有一个字节 --> 一个字符映射,但有字节序列 --> 一个字符映射

最著名的多字节编码之一是 UTF-8;UTF-8 是一种变长编码,有的字符用一个字节表示,有的用多个字节表示;

UTF-8 与一些单字节编码如 us7ascii 或 ISO8859-1 重叠;它可以看作是一个字节编码的扩展.

让我们看看 UTF-8 在第一个例子中的作用

 byte[] stringInByte=new byte[]{104,101,108,108,111};String simple=new String(stringInByte);System.out.println("simple=" + simple);//打印simple=helloString simple3=new String(stringInByte, "UTF-8");System.out.println("simple3=" + simple3);//也打印simple=hello

正如您在尝试代码时看到的那样,它会打印 hello,即 UTF-8 和 ISO8859-1 中表示 hello 的字节是相同的.

但是如果您尝试带有 € 标志的样品,您会得到 ?

 byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};String simple1=new String(stringInByte1,"ISO8859-15");System.out.println("simple1=" + simple1);//打印simple1=helloString simple4=new String(stringInByte1, "UTF-8");System.out.println("simple4=" + simple4);//打印simple4=hello ?

表示无法识别字符并且存在错误.请注意,即使在转换过程中出现错误,您也不会出现异常.

不幸的是,在处理无效字符时,并非所有 java 类的行为都相同;让我们看看当我们处理 xml 时会发生什么.

管理 XML

在通过示例之前值得记住的是,Java InputStream/OutputStream 读/写字节和 Reader/Writer 读/写字符.

让我们尝试以一些不同的方式读取 xml 的字节序列,即读取文件以获得字符串与读取文件以获得 DOM.

//创建一个xml文件String xmlSample="<?xml version="1.0" encoding="UTF-8"?>
<specialchars>àèìòù€</specialchars>";尝试(FileOutputStream fosXmlFileOutputStreame = new FileOutputStream(test.xml")){//写入编码错误的文件fosXmlFileOutputStreame.write(xmlSample.getBytes("ISO8859-15"));}尝试 (FileInputStream xmlFileInputStream= new FileInputStream("test.xml");//使用xml头文件中声明的编码读取文件InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");){char[] cbuf=new char[xmlSample.length()];inputStreamReader.read(cbuf);System.out.println("用UTF-8读取的文件=" + new String(cbuf));//印刷//文件读取UTF-8=<?xml version="1.0" encoding="UTF-8"?>//<specialchars>      </specialchars>}File xmlFile = new File("test.xml");DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();文档 doc = dBuilder.parse(xmlFile);//抛出

<块引用>

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException:3 字节 UTF-8 序列的第 2 字节无效

在第一种情况下,结果是一些奇怪的字符,但没有异常,在第二种情况下,您会得到一个异常(序列无效....)发生异常是因为您正在读取 UTF-8 序列的三个字节字符,而第二个字节具有无效值(因为 UTF-8 字符编码方式).

棘手的部分是,由于 UTF-8 与其他一些编码重叠,因此 3 字节 UTF-8 序列异常的无效字节 2 出现随机"(即仅适用于字符超过一个字节的消息),因此在生产环境中,错误可能难以跟踪和重现.

有了所有这些信息,我们可以尝试回答以下问题:

<块引用>

为什么我在读取/处理 xml 文件时得到 y 字节 UTF-8 序列异常的无效字节 x?

因为用于写入的编码(上面测试用例中的ISO8859-15)和用于读取的编码(上面测试用例中的UTF-8)存在不匹配;不匹配可能有一些不同的原因:

  1. 您在字节和字符之间进行了一些错误的转换:例如,如果您正在使用 InputStream 读取文件并将其转换为 Reader 并将 Reader 传递给 xml 库您必须在以下代码中指定字符集名称(即您必须知道用于保存文件的编码)

    <代码>试试 (FileInputStream xmlFileInputStream= new FileInputStream("test.xml");//这是xml库的读取器(例如DOM4J、JDOM)//UTF-8 是文件编码,如果你指定了错误的编码或者你没有指定任何编码,你可能会面临 Invalid byte x of y-byte UTF-8 sequence ExceptionInputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");)

  2. 您将 InputStream 直接传递给 xml 库,但该文件不正确(如第一个管理 xml 的示例,其中标头说明 UTF-8,但实际编码为 ISO8859-15.仅仅放在文件的第一行是不够的;该文件必须与标题中使用的编码一起保存.

  3. 您正在使用未指定编码而创建的阅读器读取文件,并且平台编码与文件编码不同:

    FileReader fileReader=new FileReader("text.xml");

这导致一个方面,至少对我来说它是java中大多数字符串编码问题的根源:使用默认平台编码

打电话时

"Hello €".getBytes();

在不同的操作系统上可以得到不同的结果;这是因为在 Windows 上,默认编码是 Windows-1252,而在 linux 上它可能是 UTF-8;€ char 的编码方式不同,因此您不仅会得到不同的字节,还会得到不同的数组大小:

 String helloEuro="hello €";//在iso8859-15 = 7中打印hello euro byte[]大小System.out.println("hello euro byte[] size in iso8859-15 = " + helloEuro.getBytes("ISO8859-15").length);//以utf-8 = 9打印hello euro byte[]大小System.out.println("hello euro byte[] size in utf-8 = " + helloEuro.getBytes("UTF-8").length);

使用 String.getBytes() 或 new String(byte[] ...) 而不指定编码是遇到编码问题时要做的第一个检查

第二个是检查您是否正在使用 FileReader 或 FileWriter 读取或写入文件;在这两种情况下,文档 指出:

此类的构造函数假定默认字符编码和默认字节缓冲区大小是可以接受的

与 String.getBytes() 一样,在不同平台上使用读取器/写入器读取/写入相同的文件,并且不指定字符集可能会由于不同的默认平台编码而导致不同的字节序列>

javadoc 建议的解决方案是使用 OutputStreamReader/OutputStreamWriter 将 OutputStream/InputStream 与字符集规范包装在一起.

关于一些 xml 库如何读取 XML 内容的一些最终说明:

  1. 如果你传递一个 Reader,库依赖于阅读器进行编码(即它不检查 xml 标头所说的内容)并且不涉及编码,因为它读取的是字符而不是字节.

  2. 如果你传递一个 InputStream 或一个 File 库依赖于 xml 头进行编码,它可能会抛出一些编码异常

数据库

在处理数据库时可能会出现不同的问题;创建数据库时,它有一个编码属性,用于保存 varchar 和 string 列(作为 clob).如果数据库是用 8 位编码(例如 ISO8859-15)创建的,当您尝试插入编码不允许的字符时,可能会出现问题.保存在 db 中的内容可能与 Java 级别指定的字符串不同,因为在 Java 中,字符串在内存中以 UTF-16 表示,这比在数据库级别指定的字符串更宽".最简单的解决方案是:使用 UTF-8 编码创建数据库.

网络这是一个很好的开始点.

如果您觉得缺少某些东西,请随时在评论中提出更多要求.

I searched about java and encoding and I did not found a resource explaining how to deal with commons problems that arise in java when encoding and decoding strings. There are a lot of specific questions about single errors but I did not found a wide response/reference guide to the problem. The main questions are:

What is String encoding?

Why in Java can I read files with wrong charatecters?

Why when dealing with xml I got Invalid byte x of y-byte UTF-8 sequence Exception? What are the main causes and how to avoid them?

解决方案

Since Stackoverflow encourages self answers I try to respond to myself.

Encoding is the process of converting data from one format to another, this response I details how String encoding works in Java (you may want to read this for a more generic introduction to text end encoding).

Introduction

String encoding/decoding is the process that transforms a byte[] into a String and vice-versa.

At a first sight you may think that there are no problems, but if we look more deeply to the process some issues may arise. At the lowest level information is stored/transmitted in bytes: files are a sequence of bytes and network communication is done by sending and receiving bytes. So every time you want to read or write a file with plain readable content or every time you submit a web form/read a web page there is an underlying encoding operation. Let's start from the basic String encoding operation in java; creating a String from a sequence of bytes. The following code converts a byte[] (the bytes may come from a file or from a socket) into a String.

    byte[] stringInByte=new byte[]{104,101,108,108,111};
    String simple=new String(stringInByte);
    System.out.println("simple=" + simple);//prints simple=hello

so far so good, all "simple". The value of the bytes are taken from here which shows one way to map letters and numbers to bytes Let's complicate the sample with a simple requirement the byte[] contains the € (euro) sign; oops, there is no euro symbol in the ascii table.

This can be roughly summarized as the core of the problem, the human readable characters (together with some other necessary ones such as carriage return, line feed, etc) are more than 256, i.e. it cannot be represented with only one byte. If for some reason you must stick with a single byte representation (i.e. historical reasons the first encoding tables were using only 7 bytes, space constraints reason, if the space on the disk is limited and you write text documents only for English people there is not need to include Italian letters with an accent such as è,ì) you have the problem of choosing which characters to represent.

Choosing an encoding is choosing a mapping between bytes and chars.

Coming back to the euro example and sticking with one byte --> one char mapping the ISO8859-15 encoding table has the € sign; The sequence of bytes for representing the string "hello €" is the following one

byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};

How do you "tell" to java which encoding to use for the conversion? The String has the constructor

String(byte[] bytes, String charsetName)

That allows to specify "the mapping" If you use different charsets you get different output results as you can see below:

    byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};
    String simple1=new String(stringInByte1,"ISO8859-15");
    System.out.println("simple1=" + simple1);  //prints simple1=hello €     

    String simple2=new String(stringInByte1,"ISO8859-1");
    System.out.println("simple2=" + simple2);   //prints simple1=hello ¤

So this explains why you read some characters and read different one the encoding used for writing (String to byte[]) is different from the one used for reading (byte[] to String). The same byte may map to different characters in different encoding so some characters may "look strange".
These are the basic concepts needed to understand String encoding; let's complicate the matter a little bit more. There may be the need to represent more than 256 symbols in one text document, in order to achieve this multi byte encoding have been created.

With multibyte encoding there is no more one byte --> one char mapping but there is sequence of bytes --> one char mapping

One of the most known multibyte encoding is UTF-8; UTF-8 is a variable length encoding, some chars are represented with one byte some others with more than one;

UTF-8 overlaps with some one byte encoding such as us7ascii or ISO8859-1; it can be viewed as an extension of one byte encoding.

Let see UTF-8 in action for the first example

    byte[] stringInByte=new byte[]{104,101,108,108,111};
    String simple=new String(stringInByte);
    System.out.println("simple=" + simple);//prints simple=hello

    String simple3=new String(stringInByte, "UTF-8");
    System.out.println("simple3=" + simple3);//also this prints simple=hello

As you can see trying the code it prints hello, i.e. the bytes to represent hello in UTF-8 and ISO8859-1 are the same.

But if you try the sample with the € sign you got a ?

    byte[] stringInByte1=new byte[]{104,101,108,108,111,32,(byte)164};
    String simple1=new String(stringInByte1,"ISO8859-15");
    System.out.println("simple1=" + simple1);//prints simple1=hello

    String simple4=new String(stringInByte1, "UTF-8");
    System.out.println("simple4=" + simple4);//prints simple4=hello ?

meaning that the char is not recognized and that there is an error. Note that you get no exception even if there is an error during the conversion.

Unfortunately not all java classes behave the same way when dealing with invalid chars; let see what happens when we deal with xml.

Managing XML

Before going through the examples is worth remembering that in Java InputStream/OutputStream read/write bytes and Reader/Writer read/write characters.

Let's try to read the sequence of bytes of a xml in some different ways, i.e reading files in order to get a String vs reading the file in order to get a DOM.

    //Create a xml file
    String xmlSample="<?xml version="1.0" encoding="UTF-8"?>
<specialchars>àèìòù€</specialchars>";
    try(FileOutputStream fosXmlFileOutputStreame= new FileOutputStream("test.xml")) {
        //write the file with a wrong encoding
        fosXmlFileOutputStreame.write(xmlSample.getBytes("ISO8859-15"));
    }

    try (
            FileInputStream xmlFileInputStream= new FileInputStream("test.xml");
            //read the file with the encoding declared in the xml header
            InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8");
    ) {
        char[] cbuf=new char[xmlSample.length()];
        inputStreamReader.read(cbuf);
        System.out.println("file read with UTF-8=" + new String(cbuf)); 
        //prints
        //file read with UTF-8=<?xml version="1.0" encoding="UTF-8"?>
        //<specialchars>������</specialchars>
    }


    File xmlFile = new File("test.xml");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(xmlFile);     
    //throws  

com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 2 of 3-byte UTF-8 sequence

In the first case the result are some strange chars but no Exception, in the second case you get an exception (Invalid sequence....) The exception occurs because you are reading a three bytes char of a UTF-8 sequence and the second byte has an invalid value (because of the UTF-8 way of encoding chars).

The tricky part is that since UTF-8 overlaps with some other encoding the Invalid byte 2 of 3-byte UTF-8 sequence exceptions arise "random" (i.e. only for the messages with characters represented by more than one byte), so in production environment the error can be difficult to track and to reproduce.

With all these information we can try to answer to the following question:

Why do I get Invalid byte x of y-byte UTF-8 sequence Exception when reading/dealing with a xml file?

Because there is a mismatch from the encoding used for writing (ISO8859-15 in the test case above) and the encoding for reading (UTF-8 in the test case above); the mismatch may have some different causes:

  1. you are making some wrong conversion between bytes and char: for example if you are reading a file with a InputStream and converting into a Reader and passing the Reader to the xml library you must specify the charset name as in the following code (i.e. you must know the encoding used for saving the file)

    try ( FileInputStream xmlFileInputStream= new FileInputStream("test.xml"); //this is the reader for the xml library (DOM4J, JDOM for example) //UTF-8 is the file encoding if you specify a wrong encoding or you do not apsecify any encoding you may face Invalid byte x of y-byte UTF-8 sequence Exception InputStreamReader inputStreamReader= new InputStreamReader(xmlFileInputStream,"UTF-8"); )

  2. you are passing the InputStream directly to xml library but the file the file is not correct (as in first the example of managing xml where the header states UTF-8 but the real encoding is ISO8859-15. Simply putting in the first line of the file is not enough; the file must be saved with the encoding used in the header.

  3. you are reading the file with a reader created without specifying an encoding and the platform encoding is different from file encoding:

    FileReader fileReader=new FileReader("text.xml");
    

This lead to one aspect that at least for me it is the source of the most of the String encoding problems in java: using the default platform encoding

When you call

"Hello €".getBytes();

you can get different results on different operating systems; this is because on windows the default encoding is Windows-1252 while on linux it may be UTF-8; the € char is encoded differently so you get not only different bytes but also different array sizes:

    String helloEuro="hello €";
    //prints hello euro byte[] size in iso8859-15 = 7
    System.out.println("hello euro byte[] size in iso8859-15 = " + helloEuro.getBytes("ISO8859-15").length);
    //prints hello euro byte[] size in utf-8 = 9
    System.out.println("hello euro byte[] size in utf-8 = " + helloEuro.getBytes("UTF-8").length);

Using String.getBytes() or new String(byte[] ...) without specifying an encoding is the first check to do when you run into encoding issues

The second one is checking if you are reading or writing files using FileReader or FileWriter; in both cases the documentation states:

The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable

As with String.getBytes() reading/writing the same file on different platforms with a reader/writer and without specifying the charset may lead to different byte sequence due to different default platform encoding

The solution, as the javadoc suggest is to use OutputStreamReader/OutputStreamWriter that wraps an OutputStream/InputStream together with a charset specification.

Some final notes on how some xml libraries read XML content:

  1. if you pass a Reader the library relies on the reader for the encoding (i.e. it does not check what the xml header says) and does not anything about encoding since it is reading chars not bytes.

  2. if you pass an InputStream or a File library relies on the xml header for the encoding and it may throw some encoding Exceptions

Database

A different issue may arise when dealing with databases; when a database is created it has an encoding property used to save the varchar and string column (as clob). If the database is created with a 8 bit encoding (ISO8859-15 for example) problems may arise when you try to insert chars not allowed by the encoding. What is saved on the db may be different from the string specified at Java level because in Java strings are represented in memory in UTF-16 which is "wider" than the one specified at the database level. The simplest solution is : create you database with a UTF-8 encoding.

web this is a very good starting point.

If you feel something is missing feel free to ask for something more in the comments.

这篇关于如何处理java编码问题(尤其是xml)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆