哪个HTML解析器是最好的? [英] Which HTML Parser is the best?
问题描述
现在,我想将这两项任务分开。
由于80%的工作仅涉及解析,因此我需要使用浅HTML解析器,因为它需要很长时间才能首先加载页面,然后获取源代码然后解析它。
我想知道哪个HTML解析器是最好的。如果分析器接近HtmlUnit分析器,分析器会更好。
编辑: p>
通过最好的方式,我至少需要以下功能:
- 速度
- 轻松通过其id或name或tag type来定位任何HtmlElement。
如果它不清除脏HTML代码,那对我来说可以。我不需要清理任何HTML源代码。我只需要一个最简单的方法来移动HtmlElements并从中收集数据。 自解释:我刚刚发布了一个新的Java HTML解析器: jsoup 。我在这里提到它,因为我认为它会做你以后的事情。
它的派对技巧是查找元素的CSS选择器语法,例如:
String html =< html>< head>< title>首先解析< / title>< / head>
+< body>< p>将HTML解析为文档。< / p>< / body>< / html>;
Document doc = Jsoup.parse(html);
元素链接= doc.select(a);
Element head = doc.select(head)。first();
请参阅 Selector javadoc获取更多信息。
这是一个新项目,欢迎!
I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation.
Now, I want to separate both the tasks.
As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
I want to know which HTML parser is the best. The parser would be better if it is close to HtmlUnit parser.
EDIT:
By best, I want at least the following features:
- Speed
- Ease to locate any HtmlElement by its "id" or "name" or "tag type".
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
这篇关于哪个HTML解析器是最好的?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!