使用 Java 和 UTF-8 编码生成有效的 XML [英] Producing valid XML with Java and UTF-8 encoding

查看:24
本文介绍了使用 Java 和 UTF-8 编码生成有效的 XML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 JAXP 生成并解析一个 XML 文档,其中一些字段是从数据库加载的.

I am using JAXP to generate and parse an XML document from which some fields are loaded from a database.

序列化 XML 的代码:

Code to serialize the XML:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("test");
root.setAttribute("version", text);
doc.appendChild(root);

DOMSource domSource = new DOMSource(doc);
TransformerFactory tFactory = TransformerFactory.newInstance();

FileWriter out = new FileWriter("test.xml");
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(domSource, new StreamResult(out)); 

解析 XML 的代码:

Code to parse the XML:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse("test.xml");

我遇到以下异常:

[Fatal Error] test.xml:1:4: Invalid byte 1 of 1-byte UTF-8 sequence.
Exception in thread "main" org.xml.sax.SAXParseException: Invalid byte 1 of 1-byte UTF-8 sequence.
    at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
    at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
    at javax.xml.parsers.DocumentBuilder.parse(Unknown Source)
    at com.test.Test.xml(Test.java:27)
    at com.test.Test.main(Test.java:55)

字符串文本包括 u-umlaut 和 o-umlaut(字符代码 0xFC 和 0xF6).这些是导致错误的字符.当我自己转义字符串以使用 ü和 ö那么问题就迎刃而解了.当我写出 XML 时,其他实体会自动编码.

The String text includes u-umlaut and o-umlaut (character codes 0xFC and 0xF6). These are the characters that are causing the error. When I escape the String myself to use ü and ö then the problem goes away. Other entities are automatically encoded when I write out the XML.

如何在不自己替换这些字符的情况下正确写入/读取我的输出?

How do I get my output to be written / read properly without substituting these characters myself?

(我已经阅读了以下问题:

(I've read the following questions already:

如何将字符从 Oracle 编码为 XML?

修复 XML 文件中的错误编码)

推荐答案

使用 FileOutputStream 而不是 FileWriter.

Use a FileOutputStream rather than a FileWriter.

后者应用自己的编码,几乎可以肯定不是 UTF-8(取决于您的平台,可能是 Windows-1252 或 IS-8859-1).

The latter applies its own encoding, which is almost certainly not UTF-8 (depending on your platform, it's probably Windows-1252 or IS-8859-1).

编辑(现在我有一些时间):

Edit (now that I have some time):

允许将没有序言的 XML 文档编码为 UTF-8 或 UTF-16.对于序言,允许指定其编码(序言只能包含 US-ASCII 字符,因此序言始终可读).

An XML document without a prologue is permitted to be encoded as UTF-8 or UTF-16. With a prologue, it iss allowed to specify its encoding (the prologue can contain only US-ASCII characters, so prologue is always readable).

Reader 处理字符;它将解码底层 InputStream 的字节流.因此,当您将 Reader 传递给解析器时,您是在告诉它您已经处理了编码,因此解析器将忽略序言.当您传递 InputStream(读取字节)时,它不会做出这种假设,并且会查看序言来定义编码——如果不存在,则默认为 UTF-8/UTF-16.

A Reader deals with characters; it will decode the byte stream of the underlying InputStream. As a result, when you pass a Reader to the parser, you are telling it that you've already handled the encoding, so the parser will ignore the prologue. When you pass an InputStream (which reads bytes), it does not make this assumption, and will look to the prologue to define the encoding -- or default to UTF-8/UTF-16 if it's not there.

我从未尝试读取以 UTF-16 编码的文件.我怀疑解析器会寻找字节顺序标记 (BOM) 作为文件的前 2 个字节.

I've never tried reading a file that is encoded in UTF-16. I suspect that the parser will look for a Byte Order Mark (BOM) as the first 2 bytes of the file.

这篇关于使用 Java 和 UTF-8 编码生成有效的 XML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆