用Java解析XML时出现问题 [英] Problems parsing XML in Java
问题描述
我在解析XML文档时遇到了一些麻烦.由于某些原因,有些文本节点是我所不希望的,因此我的测试变成红色. XML文件如下所示:
I got some trouble parsing an XML document. For some reason, there are text nodes where I would not expect them to be and therefore my test turns red. The XML file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<RootNode>
<PR1>PR1</PR1>
<ROL>one</ROL>
<ROL>two</ROL>
<DG1>DG1</DG1>
<ROL>three</ROL>
<ZBK>ZBK</ZBK>
<ROL>four</ROL>
</RootNode>
现在,我有这段代码片段可以重现该错误:
Now I have this snippet of code which can reproduce the error:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(TestHL7Helper.class.getResourceAsStream("TestHL7HelperInput.xml"));
Node root = doc.getFirstChild();
Node pr1 = root.getFirstChild();
检查根变量会产生[RootNode: null]
,这似乎是正确的,但后来却以某种方式全部出错. pr1
变量原来是文本节点[#text:\n ]
-但是解析器为什么认为换行符和空格是文本节点?那不应该被忽略吗?我尝试更改编码,但这也无济于事.有什么想法吗?
Inspecting the root variable yields [RootNode: null]
which seems to be right, but then it somehow goes all wrong. The pr1
variable turns out to be a text node [#text:\n ]
- but why does the parser think that the new line and the spaces are a text node? Shouldn't that be ignored? I tried changing the encoding but that did not help either. Any ideas on that?
如果我删除所有新行和空格并仅将XML文档放在一行中,则一切正常...
If I remove all new lines and space and have my XML document in just one line it all works fine...
推荐答案
XML支持混合内容,这意味着元素可以同时具有文本和元素子节点.这是为了支持以下用例:
XML supports mixed content meaning elements can have both text and element child nodes. This is to support use cases like the following:
<text>I've bolded the <b>important</b> part.</text>
input.xml
这意味着默认情况下,DOM解析器会将以下文档中的空白节点视为有效节点(以下是XML文档的简化版本):
This means that by default a DOM parser will treat the whitespace nodes in the following document as significant (below is a simplified version of your XML document):
<RootNode>
<PR1>PR1</PR1>
</RootNode>
演示代码
如果您有XML模式,则可以在DocumentBuilderFactory
上设置ignoringElementContentWhitespace
属性,因为这样DOM解析器将知道空白是否有效以及何时有效.
If you have an XML schema you can set the ignoringElementContentWhitespace
property on the DocumentBuilderFactory
since then the DOM parser will know if and when the whitespace is significant.
import java.io.File;
import javax.xml.XMLConstants;
import javax.xml.parsers.*;
import javax.xml.validation.*;
import org.w3c.dom.Document;
public class Demo {
public static void main(String[] args) throws Exception {
SchemaFactory sf = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema s = sf.newSchema(new File("src/forum16231687/schema.xsd"));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setSchema(s);
dbf.setIgnoringElementContentWhitespace(true);
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new File("src/forum16231687/input.xml"));
System.out.println(d.getDocumentElement().getChildNodes().getLength());
}
}
schema.xsd
如果您创建的schema.xsd
如下所示,则演示代码将报告根元素具有1个子节点.
If you create schema.xsd
that looks like the following then the demo code will report that the root element has 1 child node.
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
<element name="RootNode">
<complexType>
<sequence>
<element name="PR1" type="string"/>
</sequence>
</complexType>
</element>
</schema>
如果更改schema.xsd以使RootNode
具有混合内容,则演示代码将报告RootNode
具有3个子节点.
If you change schema.xsd so that the RootNode
has mixed content the demo code will report that the RootNode
has 3 child nodes.
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema">
<element name="RootNode">
<complexType mixed="true">
<sequence>
<element name="PR1" type="string"/>
</sequence>
</complexType>
</element>
</schema>
这篇关于用Java解析XML时出现问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!