使用Java和UTF-8编码生成有效的XML [英] Producing valid XML with Java and UTF-8 encoding

查看:654
本文介绍了使用Java和UTF-8编码生成有效的XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用JAXP生成和解析从数据库中加载一些字段的XML文档。

I am using JAXP to generate and parse an XML document from which some fields are loaded from a database.

序列化XML的代码:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("test");
root.setAttribute("version", text);
doc.appendChild(root);

DOMSource domSource = new DOMSource(doc);
TransformerFactory tFactory = TransformerFactory.newInstance();

FileWriter out = new FileWriter("test.xml");
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(domSource, new StreamResult(out)); 

解析XML的代码:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("test.xml");

我遇到以下异常:

[Fatal Error] test.xml:1:4: Invalid byte 1 of 1-byte UTF-8 sequence.
Exception in thread "main" org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at com.test.Test.xml(Test.java:27)
    at com.test.Test.main(Test.java:55)

字符串文本包括u-umlaut和o-umlaut(字符代码0xFC和0xF6)。这些是导致错误的字符。当我逃避String自己使用& #xFC;和ö那么问题就消失了。当我写出XML时,其他实体会自动编码。

The String text includes u-umlaut and o-umlaut (character codes 0xFC and 0xF6). These are the characters that are causing the error. When I escape the String myself to use ü and ö then the problem goes away. Other entities are automatically encoded when I write out the XML.

如何自己编写/读取我的输出,而不用这些字符代替?

How do I get my output to be written / read properly without substituting these characters myself?

(我已经阅读了以下问题:

(I've read the following questions already:

如何将字符从Oracle编码为XML?

修复XML文件中的错误编码

推荐答案

使用FileOutputStream而不是FileWriter。

Use a FileOutputStream rather than a FileWriter.

后者应用自己的编码,这几乎肯定不是UTF- 8(取决于你的平台,它可能是Windows-1252或IS-8859-1)。

The latter applies its own encoding, which is almost certainly not UTF-8 (depending on your platform, it's probably Windows-1252 or IS-8859-1).

编辑(现在我有一段时间):

Edit (now that I have some time):

允许将不带序言的XML文档编码为UTF-8或UTF-16。通过序言,它允许指定其编码(序言可以只包含US-ASCII字符,因此序言总是可读的)。

An XML document without a prologue is permitted to be encoded as UTF-8 or UTF-16. With a prologue, it iss allowed to specify its encoding (the prologue can contain only US-ASCII characters, so prologue is always readable).

读者处理字符;它将解码底层InputStream的字节流。因此,当您将Reader传递给解析器时,您会告诉您已经处理了编码,因此解析器将忽略该序言。当您传递一个InputStream(读取字节)时,它不会做出这样的假设,并且将查看序言来定义编码 - 如果不存在,则返回UTF-8 / UTF-16。

A Reader deals with characters; it will decode the byte stream of the underlying InputStream. As a result, when you pass a Reader to the parser, you are telling it that you've already handled the encoding, so the parser will ignore the prologue. When you pass an InputStream (which reads bytes), it does not make this assumption, and will look to the prologue to define the encoding -- or default to UTF-8/UTF-16 if it's not there.

我从来没有尝试过读取以UTF-16编码的文件。我怀疑解析器将寻找一个字节顺序标记(BOM)作为文件的前2个字节。

I've never tried reading a file that is encoded in UTF-16. I suspect that the parser will look for a Byte Order Mark (BOM) as the first 2 bytes of the file.

这篇关于使用Java和UTF-8编码生成有效的XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆