Xerces DOM解析器令人难以置信的慢？ [英] Xerces DOM parser incredibly slow?

查看：117 发布时间：2017/6/25 0:34:17 java performance dom xerces

本文介绍了Xerces DOM解析器令人难以置信的慢？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

目前，我正在使用JTidy清理一个HTML文件，将其转换为XHTML，并将结果提供给DOM解析器。以下代码是这些努力的结果：

Currently, I am trying to clean up an HTML file using JTidy, convert it to XHTML and provide the results to a DOM parser. The following code is the result of these efforts:

public class HeaderBasedNewsProvider implements INewsProvider {

    /* ... */

    public Collection<INewsEntry> getNewsEntries() throws NewsUnavailableException {
            Document document;
        try {
            document = getCleanedDocument();
        } catch (Exception e) {
            throw new NewsUnavailableException(e);
        }
        System.err.println(document.getDocumentElement().getTextContent());
        return null;
    }

    private final Document getCleanedDocument() throws IOException, SAXException, ParserConfigurationException {
        InputStream input = inputStreamProvider.getInputStream();
        Tidy tidy = new Tidy();
        tidy.setXHTML(true);
        ByteArrayOutputStream tidyOutputStream = new ByteArrayOutputStream();
        tidy.parse(input, tidyOutputStream);
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setValidating(false);
        InputStream domInputStream = new ByteArrayInputStream(tidyOutputStream.toByteArray());
        System.err.println(factory.getClass());
        return factory.newDocumentBuilder().parse(domInputStream);
    }
}

然而，DOM解析器实现（com.sun。我的系统上的org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl似乎是非常慢的。即使是如下所示的单行文档，解析需要2-3分钟：

However, the DOM parser implementation (com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl) on my system seems to be incredibly slow. Even for one-line documents such as the following, parsing takes 2-3 minutes:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><title></title></head><body><div class="text"><h2>Nachricht vom 16. Juni 2011</h2><h1>Titel</h1><p>Mitteilung <a href="dokumente/medienmitteilungen/MM_NR_jglp.pdf" target="_blank">weiter</a> mehr Mitteilung</p></div></body></html>

请注意，与DOM解析器相反 - JTidy在一秒钟内完成其工作。因此，我怀疑我以某种方式滥用DOM API。

Note that - in contrast to the DOM parser - JTidy finishes its work within a second. Therefore, I suspect that I'm somehow misusing the DOM API.

提前感谢您对此的任何建议！

Thanks in advance for any suggestions on this one!

Xerces DOM解析器令人难以置信的慢？ [英] Xerces DOM parser incredibly slow?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

Xerces DOM解析器令人难以置信的慢？ [英] Xerces DOM parser incredibly slow?

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭