如何在读取文件时忽略空格以生成XML DOM [英] How to ignore whitespace while reading a file to produce an XML DOM
问题描述
我正在尝试读取一个文件来生成一个DOM文档,但该文件有空格和换行符,我试图忽略它们,但我不能:
I'm trying to read a file to produce a DOM Document, but the file has whitespace and newlines and I'm trying to ignore them, but I couldn't:
DocumentBuilderFactory docfactory=DocumentBuilderFactory.newInstance();
docfactory.setIgnoringElementContentWhitespace(true);
我在Javadoc中看到setIgnoringElementContentWhitespace方法仅在启用验证标志时才运行,但我没有文档的DTD或XML模式。
I see in Javadoc that setIgnoringElementContentWhitespace method operates only when the validating flag is enabled, but I haven't the DTD or XML Schema for the document.
我该怎么办?
更新
我不喜欢介绍mySelf的想法< !ELEMENT ...声明,我已经尝试了解决方案Tomalak指出的>论坛,但它不起作用,我在linux环境中使用了java 1.6。我想如果不再提议,我会做一些方法来忽略空格文本节点
I don't like the idea of introduce mySelf < !ELEMENT... declarations and i have tried the solution proposed in the forum pointed by Tomalak, but it doesn't work, i have used java 1.6 in an linux environment. I think if no more is proposed i will make a few methods to ignore whitespace text nodes
推荐答案
'IgnoringElementContentWhitespace'不是关于删除所有纯空白文本节点,只有空格在模式中描述为具有ELEMENT内容的空格节点 - 也就是说,它们只包含其他元素而不包含文本。
‘IgnoringElementContentWhitespace’ is not about removing all pure-whitespace text nodes, only whitespace nodes whose parents are described in the schema as having ELEMENT content — that is to say, they only contain other elements and never text.
如果您没有使用架构(DTD或XSD),则元素内容默认为MIXED,因此此参数将永远不会产生任何影响。 (除非解析器提供非标准DOM扩展来将所有未知元素视为包含ELEMENT内容,据我所知,Java可用的内容不会。)
If you don't have a schema (DTD or XSD) in use, element content defaults to MIXED, so this parameter will never have any effect. (Unless the parser provides a non-standard DOM extension to treat all unknown elements as containing ELEMENT content, which as far as I know the ones available for Java do not.)
您可以在进入解析器的途中破解文档以包含架构信息,例如通过向<中添加内部子集。 !DOCTYPE ... [...]>声明包含< !ELEMENT ...>声明,然后使用IgnoringElementContentWhitespace参数。
You could hack the document on the way into the parser to include the schema information, for example by adding an internal subset to the < !DOCTYPE ... [...] > declaration containing < !ELEMENT ... > declarations, then use the IgnoringElementContentWhitespace parameter.
或者,可能更容易,您可以在后处理中删除空白节点,或者当他们使用LSParserFilter时。
Or, possibly easier, you could just strip out the whitespace nodes, either in a post-process, or as they come in using an LSParserFilter.
这篇关于如何在读取文件时忽略空格以生成XML DOM的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!