Xerces DOM解析器令人难以置信的慢? [英] Xerces DOM parser incredibly slow?
问题描述
目前,我正在使用JTidy清理一个HTML文件,将其转换为XHTML,并将结果提供给DOM解析器。以下代码是这些努力的结果:
Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:
public class HeaderBasedNewsProvider implements INewsProvider {
/* ... */
public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
Document document;
try {
document = getCleanedDocument();
} catch (Exception e) {
throw new NewsUnavailableException(e);
}
System.err.println(document.getDocumentElement().getTextContent());
return null;
}
private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
InputStream input = inputStreamProvider.getInputStream();
Tidy tidy = new Tidy();
tidy.setXHTML(true);
ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
tidy.parse(input, tidyOutputStream);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setValidating(false);
InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
System.err.println(factory.getClass());
return factory.newDocumentBuilder().parse(domInputStream);
}
}
然而,DOM解析器实现(com.sun。我的系统上的org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl似乎是非常慢的。即使是如下所示的单行文档,解析需要2-3分钟:
However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. Even for one-line documents such as the following, parsing takes 2-3 minutes:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>
请注意,与DOM解析器相反 - JTidy在一秒钟内完成其工作。因此,我怀疑我以某种方式滥用DOM API。
Note that - in contrast to the DOM parser - JTidy finishes its work within a second. Therefore, I suspect that I'm somehow misusing the DOM API.
提前感谢您对此的任何建议!
Thanks in advance for any suggestions on this one!
推荐答案
即使不验证,XML解析器也需要获取DTD,例如支持命名字符实体。您应该考虑实施一个 EntityResolver 它将DTD的请求解析为本地副本。
Even when not validating, a XML parser needs to fetch the DTD, for example to support named character entities. You should look into implementing an EntityResolver that resolves the request for the DTD to a local copy.
这篇关于Xerces DOM解析器令人难以置信的慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!