哪个HTML解析器是最好的？ [英] Which HTML Parser is the best?

查看：65 发布时间：2018/6/13 9:31:08 java html parsing html-parsing web-scraping

本文介绍了哪个HTML解析器是最好的？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我编写了很多解析器。到目前为止，我正在使用HtmlUnit无头浏览器进行解析和浏览器自动化。

现在，我想将这两项任务分开。

由于80％的工作仅涉及解析，因此我需要使用浅HTML解析器，因为它需要很长时间才能首先加载页面，然后获取源代码然后解析它。

我想知道哪个HTML解析器是最好的。如果分析器接近HtmlUnit分析器，分析器会更好。

编辑： p>

通过最好的方式，我至少需要以下功能：

速度

轻松通过其id或name或tag type来定位任何HtmlElement。

如果它不清除脏HTML代码，那对我来说可以。我不需要清理任何HTML源代码。我只需要一个最简单的方法来移动HtmlElements并从中收集数据。 自解释：我刚刚发布了一个新的Java HTML解析器： jsoup 。我在这里提到它，因为我认为它会做你以后的事情。

它的派对技巧是查找元素的CSS选择器语法，例如：
String html =< html>< head>< title>首先解析< / title>< / head> +< body>< p>将HTML解析为文档。< / p>< / body>< / html>; Document doc = Jsoup.parse（html）; 元素链接= doc.select（a）; Element head = doc.select（head）。first（）;
请参阅 Selector javadoc获取更多信息。

这是一个新项目，欢迎！
I code a lot of parsers. Up until now, I was using HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

As 80% of my work involves just parsing, I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which HTML parser is the best. The parser would be better if it is close to HtmlUnit parser.

EDIT:

By best, I want at least the following features:

Speed

Ease to locate any HtmlElement by its "id" or "name" or "tag type".

It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
解决方案
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html); Elements links = doc.select("a"); Element head = doc.select("head").first();
See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

这篇关于哪个HTML解析器是最好的？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

哪个HTML解析器是最好的？ [英] Which HTML Parser is the best?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

哪个HTML解析器是最好的？ [英] Which HTML Parser is the best?

问题描述

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭