Java XML解析器添加了不必要的xmlns和xml:space属性 [英] Java XML parser adding unnecessary xmlns and xml:space attributes

查看:194
本文介绍了Java XML解析器添加了不必要的xmlns和xml:space属性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Windows 10上使用Java 11(AdoptOpenJDK 11.0.5,2019年10月15日),正在解析一些旧的XHTML 1.1文件,这些文件具有以下一般形式:

I'm using Java 11 (AdoptOpenJDK 11.0.5 2019-10-15) on Windows 10. I'm parsing some legacy XHTML 1.1 files, which take the following general form:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" http://www.w3.org/MarkUp/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>

我正在使用一个简单的非验证解析器:

I'm using a simple non-validating parser:

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
  document = documentBuilder.parse(inputStream);
}

出于某种原因,它添加了额外的属性,例如 xmlns:xsi = htt p://www.w3.org/2001/XMLSchema-instance xml:space = preserve 到处都是:

For some reason it's adding extra attributes such as xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" and xml:space="preserve" all over the place:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" version="-//W3C//DTD XHTML 1.1//EN" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:lang="en">
<head xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <title xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">XHTML 1.1 Skeleton</title>
</head>
<body xmlns="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xml:space="preserve"></body>
</html>

我知道DTD可以提供默认属性值,但我不理解为什么<$ c当该名称空间中似乎没有元素或属性时,添加了$ c> xmlns:xsi 属性。

I know that DTDs can provide default attribute values, but I don't understand why the xmlns:xsi attribute was added, when there appear to be no elements or attributes in that namespace.

此外 xml:space = preserve 似乎完全不正确;我认为,只有像< pre> 这样的元素才应设置 xml:space = preserve 。 (更新: HTML5规范表示HTML默认情况下会保留空间,并且 xml:space 不得在HTML中进行序列化,因此这可能是这里的部​​分原因。我将改进HTML序列化程序忽略 xml:space 属性,这将部分缓解此问题。)

Furthermore xml:space="preserve" seems to be incorrect altogether; only elements like <pre> should have xml:space="preserve" set, I would think. (Update: The HTML5 specification indicates that HTML by default preserves space, and that xml:space must not be serialized in HTML, so maybe that was part of the reasoning here. I will improve my HTML serializer to ignore the xml:space attribute, which will partially mitigate this issue.)

还请注意 version =-/// W3C // DTD XHTML 1.1 // EN ;那是我不需要或想要的东西。

Also note the version="-//W3C//DTD XHTML 1.1//EN" as well; that's something I don't need or want.

我在做错什么吗?

有趣的是,对于严格的XHTML 1.0来说,这不是问题。

Interestingly this is not a problem with XHTML 1.0 strict.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

解析后得出的结果是:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>XHTML 1.0 Skeleton</title>
</head>
<body>
</body>
</html>

但这是-// W3C // DTD XHTML 1.1的问题加上MathML 2.0加上SVG 1.1 // EN 。因此,这似乎只是XHTML 1.1的问题。

But it is a problem with -//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN. So this seems to be just an XHTML 1.1 problem.

更新:我有一些潜在的帮助消息:如果我创建的新文档中没有DTD并将整个文档树导入到新文档中,所有这些杂项(显然来自DTD中的隐含属性)消失了,因为目标文档根本没有DTD。 (请参阅如何从Java XML DOM的DTD中强制使用具有隐含默认值的属性删除)。

Update: I have some potentially helpful news: if I create a new document without a DTD and import the entire document tree into the new document, all this cruft (which apparently comes from implied attributes in the DTD) goes away, because the destination document doesn't have a DTD at all. (See How to force removal of attributes with implied default values from DTD in Java XML DOM .) But this is very inefficient; it would be nice to turn this off altogether when parsing.

推荐答案

您是否尝试过 nonvalidating / load,将是一个很好的选择。 -dtd-grammar Xerces配置功能?

Have you tried the nonvalidating/load-dtd-grammar Xerces configuration feature?

但是,我一直在研究如何在Saxon中做到这一点,而我没有要求XML解析器不报告默认属性,而是在报告它们时将其丢弃。我将Xerces用作SAX解析器,而不是DOM解析器。 (在SAX中,使用 Attributes2.isDefaulted()报告默认属性)。

However, I've just been looking at how I do this in Saxon, and I don't ask the XML parser to not-report defaulted attributes, rather I discard them when they are reported. I'm using Xerces as a SAX parser not a DOM parser though. (In SAX, defaulted attributes are reported using Attributes2.isDefaulted()).

这篇关于Java XML解析器添加了不必要的xmlns和xml:space属性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆