JAXP在有效XML上解析错误 [英] JAXP parse error on valid XML

查看:137
本文介绍了JAXP在有效XML上解析错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试用Java在XML上运行一些XPath查询,显然推荐的方法是首先构造一个文档.

I am trying to run some XPath Queries on XML in Java and apparently the recommended way to do so is to construct a document first.

这是我使用的标准JAXP代码示例:

Here is the standard JAXP code sample that I was using:

import org.w3c.dom.Document;
import javax.xml.parsers.*;

final DocumentBuilder xmlParser = DocumentBuilderFactory.newInstance().newDocumentBuilder();
final Document doc = xmlParser.parse(xmlFile);

我也尝试了Saxon API,但遇到了相同的错误:

I also tried the Saxon API, but got the same errors:

import net.sf.saxon.s9api.*;

final DocumentBuilder documentBuilder = new Processor(false).newDocumentBuilder();
final XdmNode xdm = documentBuilder.build(new File("out/data/blog.xml"));

这是JDK 1.8中的DocumentBuilder无法解析的最小重构示例XML:

Here is a minimal reconstructed example XML which the DocumentBuilder in JDK 1.8 can't parse:

<?xml version="1.1" encoding="UTF-8" ?>
<xml>
    <![CDATA[Some example text with [funny highlight]]]>
</xml>

根据规范,在CDATA标记]]>末尾之前的方括号]是完全合法的,但解析器仅以堆栈跟踪和消息org.xml.sax.SAXParseException; XML document structures must start and end within the same entity.退出.

According to the spec, the square bracket ] just before the end of CDATA marker ]]> is perfectly legal, but the parser just exits with a stack trace and the message org.xml.sax.SAXParseException; XML document structures must start and end within the same entity..

在我的原始数据文件中,该文件包含许多CDATA节,而消息却是org.xml.sax.SAXParseException; The element type "item" must be terminated by the matching end-tag "</item>".在这两种情况下,"com.sun.org.apache.xerces"在堆栈跟踪中都会显示很多.

On my original data file which contains a lot of CDATA sections, the message is instead org.xml.sax.SAXParseException; The element type "item" must be terminated by the matching end-tag "</item>". In both cases ´com.sun.org.apache.xerces´ shows up in the stacktrace a lot.

从这两个观察结果看来,解析器似乎只是没有在]]>处结束CDATA部分.

Form both observations it seems as if the parser just didn't end the CDATA section at ]]>.

事实证明,省略<?xml ... ?>声明时,该示例将通过.在发布并添加到这里之前,我没有检查过.

As it turned out, the example will pass when the <?xml ... ?> declaration is omitted. I hadn't checked that before posting here and added it just now.

推荐答案

简短的答案:将Apache Xerces添加到构建路径,它将自动加载而不是从JDK解析器加载,并且XML也将被解析!复制粘贴Gradle依赖项:

Short answer: add Apache Xerces to the build path, it will automatically be loaded instead of the parser from the JDK and the XML will be parsed just fine! Copy-paste Gradle Dependency:

implementation "xerces:xercesImpl:2.11.0"

某些背景:Apache Xerces的确与JDK中使用的解析器相同,但即使Xerces 2.11于2013年问世,JDK仍具有较旧的版本.真是太烂了!

Some background: Apache Xerces is indeed the same parser which is also used in the JDK, but even though Xerces 2.11 dates from 2013 the JDK comes with a much older version. That really sucks!

如撒克逊人的团队所说:

As the Saxon team puts it:

Saxonica建议优先使用Apache的Xerces解析器,而不要使用JDK中捆绑的版本,该版本已知存在一些严重的错误.

Saxonica recommends use of the Xerces parser from Apache in preference to the version bundled in the JDK, which is known to have some serious bugs.

如果您想知道如何简单地将Xerces放在类路径上会使问题消失:即使JDK和Saxon DocumentBuilders构造了完全不同的文档类型,它们都使用相同的标准Java接口来调用解析器,并且使用相同的机制来查找和加载解析器(或更确切地说是解析器工厂).简而言之,调用java.util.ServiceLoader并在类路径中的所有JAR中查找META-INF/services中的属性文件,这就是xercesJar宣布其确实提供XML解析器的方式.对我们有利的是,JDK自己的实现已被那里找到的任何内容所取代.

In case you wonder how simply putting Xerces on the classpath makes the problem disappear: even though the JDK and Saxon DocumentBuilders construct entirely different document types, they both use the same Standard Java Interfaces to call the parser and also the same mechanism to find and load the parser (or rather, the parser factory). In short, a java.util.ServiceLoader is called and looks into all the JARs in the classpath for properties files in META-INF/services and this is how the xercesJar announces that it does provide an XML parser. And good for us, the JDK's own implementation is superseded by anything found there.

在对JDK XML类有不好的经验之后,我更加有动力去重构项目,以使用Saxon进行XPath处理,而不是在JDK中实现XPath.另一个原因是 XDM相对于DOM的技术优势(与上面相同的链接).

After making this bad experience with JDK XML classes, I am even more motivated to refactor projects to use Saxon for XPath processing instead of the implementation of XPath in the JDK. The other reason is the technical advantage of XDM over DOM (same link as above).

这篇关于JAXP在有效XML上解析错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆