领先的 Java HTML 解析器的优缺点是什么? [英] What are the pros and cons of the leading Java HTML parsers?

查看:20
本文介绍了领先的 Java HTML 解析器的优缺点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

搜索 SO 和 Google,我发现有一些 Java HTML 解析器被各方一致推荐.不幸的是,很难找到有关各种库的优缺点的任何信息.我希望有些人已经对这些库进行了一些比较,并且可以分享他们所学到的东西.

Searching SO and Google, I've found that there are a few Java HTML parsers which are consistently recommended by various parties. Unfortunately it's hard to find any information on the strengths and weaknesses of the various libraries. I'm hoping that some people have spent some comparing these libraries, and can share what they've learned.

这是我所看到的:

如果我遗漏了一个主要的解析器,我也很想知道它的优缺点.

And if there's a major parser that I've missed, I'd love to hear about its pros and cons as well.

谢谢!

推荐答案

一般

几乎所有已知的 HTML 解析器都实现了 W3C DOM API(JAXP API 的一部分,用于 XML 处理的 Java API)并为您提供 org.w3c.dom.Document 返回,可供 JAXP API 直接使用.主要的区别通常是在所讨论的解析器的特性中.大多数解析器在某种程度上对格式不正确的 HTML(tagsoup")宽容和宽容,例如 JTidy, NekoHTML, TagSoupHtmlCleaner.您通常使用这种 HTML 解析器来整理"页面.HTML 源代码(例如,将 HTML 有效的 <br> 替换为 XML 有效的 <br/>),以便您可以遍历它通常的方式"使用 W3C DOM 和 JAXP API.

General

Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy, NekoHTML, TagSoup and HtmlCleaner. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br> by a XML-valid <br />), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.

唯一跳出的是 HtmlUnitJsoup.

The only ones which jumps out are HtmlUnit and Jsoup.

HtmlUnit 提供了一个完全自己的 API,使您能够以编程方式像网络浏览器一样工作.IE.输入表单值、单击元素、调用 JavaScript 等.它不仅仅是一个 HTML 解析器.这是一个真正的无 GUI 网络浏览器"和 HTML 单元测试工具.

HtmlUnit provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.

Jsoup 还提供了一个完全自己的 API.它使您可以使用 jQuery-like CSS 选择器,并提供了一个灵活的 API 来遍历 HTML DOM 树以获取感兴趣的元素.

Jsoup also provides a completely own API. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest.

尤其是遍历 HTML DOM 树是 Jsoup 的主要优势.使用过 org.w3c.dom.Document 的人知道使用冗长的 NodeListNode API.是的,XPath 使生活更轻松,但它仍然是另一个学习曲线,最终可能仍然很冗长.

Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document know what a hell of pain it is to traverse the DOM using the verbose NodeList and Node APIs. True, XPath makes the life easier, but still, it's another learning curve and it can end up to be still verbose.

这是一个使用普通"的示例.像 JTidy 这样的 W3C DOM 解析器结合 XPath 提取问题的第一段和所有回答者的姓名(我使用 XPath,因为没有它,收集感兴趣的信息所需的代码否则会增长 10 倍,无需编写实用程序/辅助方法).

Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).

String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();
  
Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());

NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
    System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}

这是一个如何使用 Jsoup 完全相同的示例:

And here's an example how to do exactly the same with Jsoup:

String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();

Element question = document.select("#question .post-text p").first();
System.out.println("Question: " + question.text());

Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
    System.out.println("Answerer: " + answerer.text());
}

你看出区别了吗?它不仅代码更少,而且如果您已经对 CSS 选择器有一定的经验(例如开发网站和/或使用 jQuery),那么 Jsoup 也相对容易掌握.

Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).

现在应该很清楚每种方法的优缺点了.如果您只想使用标准的 JAXP API 来遍历它,那么请使用第一个提到的解析器组.有相当多的很多.选择哪一个取决于它提供的功能(如何让 HTML 清理对您来说变得容易?是否有一些侦听器/拦截器和特定于标签的清理器?)和库的健壮性(更新/维护/修复的频率如何?).如果您喜欢对 HTML 进行单元测试,那么 HtmlUnit 是您要走的路.如果您喜欢从 HTML 中提取特定数据(这通常是现实世界的要求),那么 Jsoup 是您的最佳选择.

The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lot of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.

这篇关于领先的 Java HTML 解析器的优缺点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆