领先的Java HTML解析器有什么优点和缺点? [英] What are the pros and cons of the leading Java HTML parsers?

查看:145
本文介绍了领先的Java HTML解析器有什么优点和缺点?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

搜索SO和Google,我发现有几个Java HTML解析器一直被各方推荐。不幸的是,很难找到任何有关各个图书馆的长处和短处的信息。我希望有些人花了一些时间比较这些库,并且可以分享他们学到的东西。



以下是我见过的内容:





如果我错过了一个主要的解析器, d

谢谢!

解决方案

>

常规



几乎所有已知的HTML解析器都实现了 W3C DOM API (JAXP API的一部分,用于XML处理的Java API)并为您提供 org.w3c.dom.Document 返回,可供JAXP API直接使用。主要的区别通常在解析器的功能中找到。大多数解析器在一定程度上对非格式化HTML(tagsoup)是宽容和宽松的,例如 JTidy NekoHTML TagSoup HtmlCleaner 。您通常使用这种HTML解析器来整理HTML源代码(例如,用XML有效的替换HTML有效的< br> ),以便您可以使用W3C DOM和JAXP API以常用方式来遍历它。 只有跳出来的是 HtmlUnit Jsoup

HtmlUnit



HtmlUnit 提供了一个完全自己的API,它使您能够以编程方式像浏览器一样进行操作。即输入表单值,点击元素,调用JavaScript等等。它不仅仅是一个HTML解析器。这是一个真正的无GUI界面的网页浏览器和HTML单元测试工具。

Jsoup



Jsoup 也提供了一个完全自己的API。它使您可以使用 jQuery - 像 CSS选择器,并提供了一个流畅的API来遍历HTML DOM树以获取感兴趣的元素。

特别是遍历HTML DOM树是Jsoup的主要优势。使用 org.w3c.dom.Document 的人知道使用详细的 NodeList Node API。诚然, XPath 让生活变得更轻松,但仍然是另一种学习曲线,它可能最终仍然是冗长的。

下面是一个例子,它使用像JTidy这样的简单W3C DOM解析器与XPath结合来提取问题的第一段和所有答复者的名字(我是使用XPath,因为没有它,收集感兴趣的信息所需的代码会增长10倍,而无需编写实用程序/帮助程序方法。)

  String url =http://stackoverflow.com/questions/3152138; 
Document document = new Tidy()。parseDOM(new URL(url).openStream(),null);
XPath xpath = XPathFactory.newInstance()。newXPath();

节点问题=(节点)xpath.compile(// * [@ id ='question'] // * [包含(@ class,'post-text')] // p [ 1])。evaluate(document,XPathConstants.NODE);
System.out.println(Question:+ question.getFirstChild()。getNodeValue());

NodeList answerers =(NodeList)xpath.compile(// * [@ id ='answers'] // * [contains(@ class,'user-details')] // a [ 1])。evaluate(document,XPathConstants.NODESET);
for(int i = 0; i< answerers.getLength(); i ++){
System.out.println(Answerer:+ answerers.item(i).getFirstChild()。getNodeValue ());

$ / code>

下面是一个如何与Jsoup完全相同的例子:

  String url =http://stackoverflow.com/questions/3152138; 
Document document = Jsoup.connect(url).get();

元素问题= document.select(#question .post-text p)。first();
System.out.println(Question:+ question.text());

元素answerers = document.select(#answers .user-details a); (元素回答者:回答者)
{
System.out.println(回答者:+ answerer.text());
}

您是否看到区别?这不仅仅是更少的代码,但是如果你已经有了CSS选择器的中等体验(例如通过开发网站和/或使用jQuery),Jsoup也是相对容易掌握的。

总结



现在每个人的利弊应该清楚了。如果您只是想使用标准JAXP API来遍历它,那么请参阅第一组提及的解析器。他们有很多 。选择哪一个取决于它提供的功能(如何轻松为您提供HTML清理?是否有一些监听器/拦截器和标签专用清洁器?)以及库的稳健性(更新/维护/修复的频率如何? )。如果你想单元测试HTML,那么HtmlUnit就是要走的路。如果你想从HTML中提取特定的数据(这通常是真实世界的需求),那么Jsoup就是要走的路。


Searching SO and Google, I've found that there are a few Java HTML parsers which are consistently recommended by various parties. Unfortunately it's hard to find any information on the strengths and weaknesses of the various libraries. I'm hoping that some people have spent some comparing these libraries, and can share what they've learned.

Here's what I've seen:

And if there's a major parser that I've missed, I'd love to hear about its pros and cons as well.

Thanks!

解决方案

General

Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy, NekoHTML, TagSoup and HtmlCleaner. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br> by a XML-valid <br />), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.

The only ones which jumps out are HtmlUnit and Jsoup.

HtmlUnit

HtmlUnit provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.

Jsoup

Jsoup also provides a completely own API. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest.

Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document know what a hell of pain it is to traverse the DOM using the verbose NodeList and Node APIs. True, XPath makes the life easier, but still, it's another learning curve and it can end up to be still verbose.

Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).

String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();

Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());

NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
    System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}

And here's an example how to do exactly the same with Jsoup:

String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();

Element question = document.select("#question .post-text p").first();
System.out.println("Question: " + question.text());

Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
    System.out.println("Answerer: " + answerer.text());
}

Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).

Summary

The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lot of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.

这篇关于领先的Java HTML解析器有什么优点和缺点?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆