如何使用 Java 高效解析 HTML? [英] How can I efficiently parse HTML with Java?
问题描述
我在我的工作中做了很多 HTML 解析.到目前为止,我一直在使用 HtmlUnit 无头浏览器进行解析和浏览器自动化.
I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.
现在,我想把这两个任务分开.
Now, I want to separate both the tasks.
我想使用一个轻量级的 HTML 解析器,因为在 HtmlUnit 中首先加载一个页面,然后获取源代码然后解析它需要很多时间.
I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.
我想知道哪个HTML解析器可以高效解析HTML.我需要
I want to know which HTML parser can parse HTML efficiently. I need
- 速度
- 通过其id"轻松定位任何 HtmlElement;或姓名"或标签类型".
如果它不清理肮脏的 HTML 代码对我来说没问题.我不需要清理任何 HTML 源代码.我只需要一种最简单的方法来移动 HtmlElements 并从中获取数据.
It would be ok for me if it doesn't clean the dirty HTML code. I don't need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.
推荐答案
自插:我刚刚发布了一个新的 Java HTML 解析器:jsoup.我在这里提到它是因为我认为它会满足您的需求.
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
它的聚会技巧是一个用于查找元素的 CSS 选择器语法,例如:
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
有关详细信息,请参阅 Selector javadoc.
See the Selector javadoc for more info.
这是一个新项目,所以非常欢迎任何改进的想法!
This is a new project, so any ideas for improvement are very welcome!
这篇关于如何使用 Java 高效解析 HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!