如何轻松解析HTML作为使用Java的服务消费？ [英] How to easily parse HTML for consumption as a service using Java?

查看：131 发布时间：2018/6/25 14:10:36 java html html-parsing web-scraping

本文介绍了如何轻松解析HTML作为使用Java的服务消费？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想解析HTML，例如 http： //www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top

，并且只想提取具有< a class =title

到目前为止，我所看过的选项都看起来像过度杀毒（SAX，DOM遍历） / p>

解决方案

使用 jsoup 。它支持类似jQuery的CSS选择器。以下是一个启示示例：

 字符串url =http://www.reddit.com/r/reddit.com/search q =微软和放大器;排序=顶部 
 Document document = Jsoup.connect（url）.get（）; 
 for（元素链接：document.select（a.title））{
 System.out.println（link.absUrl（href））;

结果：

 
 http://news.cnet.com/8301-13579_3-10288022-37.html 
 http://dl.getdropbox.com/u/18264/mspoland.jpg 
 http://www.reddit.com/r/reddit.com/comments/ar5z1/verizon_stealthily_installed_a_bing_search_app_on/ 
 http://www.grabup.com/uploads/240ccede5360b093dbf298f8946025a5.png 
 http：/ /www.youtube.com/watch?v=7Ym0tZSWGMc&fmt=34 
 http://i42.tinypic.com/wv5qar.jpg 
 http://www.reddit.com/r/technology/comments / 8hnya / apple_no_i_dont_want_to_make_quicktime_my_default / 
 http://cssferret.imgur.com/microsoft_wtf 
 http://imgur.com/8pct5.png 
 http://googleblog.blogspot.com/2011 /02/microsofts-bing-uses-google-search.html 
 http://news.cnet.com/8301-27076_3-20011994-248.html?part=rss&subj=news&tag=2547-1_3-0- 20 
 http://gizmodo.com/5383413/shady-microsoft-plugin-pokes-critical-hole-in-firefox-security 
 http://i.stack.imgur.com/sl1LY。 png 
 http://imgur.com/T6BMs 
 http：//www.nytimes.c om / 2010/09/14 / world / europe / 14raid.html 
 http://twitter.com/phil_nash/status/21159419598 
 http://online.wsj.com/article/SB10001424052748704415104576065641376054226。 html？mod = WSJASIA_hpp_MIDDLESecondNews 
 http://www.reddit.com/r/reddit.com/comments/bqqxv/inside_the_chinese_factory_that_makes_microsofts/ 
 http://i.min.us/iX0PA.png 
 http://imgur.com/m4nuz.gif 
 http://www.gamesforwindows.com/en-CA/Games/AgeofEmpiresIII/ 
 http://foredecker.wordpress.com/2011 / 02/27 / working-at-microsoft-day-to-day-coding / 
 http://homepage.mac.com/aleksivic/.Pictures/humor/spotTheBusey.jpg 
 http：/ /www.bloomberg.com/apps/news?pid=20601087&sid=a7uOT0ro100U&refer=home 
 http://www.microsoft.com/windowsxp/eula/pro.mspx

非常简洁，呵呵？

另见：

优点和J中主要的HTML解析器的缺点ava

I want to parse an HTML such as http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top
and only want extract the text of the element which has <a class="title"

The options I have looked so far all look like overkill (SAX, DOM traversal).

解决方案

Use Jsoup. It supports jQuery-like CSS selectors. Here's a kickoff example:

String url = "http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top";
Document document = Jsoup.connect(url).get();
for (Element link : document.select("a.title")) {
    System.out.println(link.absUrl("href"));
}

Result:

http://news.cnet.com/8301-13579_3-10288022-37.html
http://dl.getdropbox.com/u/18264/mspoland.jpg
http://www.reddit.com/r/reddit.com/comments/ar5z1/verizon_stealthily_installed_a_bing_search_app_on/
http://www.grabup.com/uploads/240ccede5360b093dbf298f8946025a5.png
http://www.youtube.com/watch?v=7Ym0tZSWGMc&fmt=34
http://i42.tinypic.com/wv5qar.jpg
http://www.reddit.com/r/technology/comments/8hnya/apple_no_i_dont_want_to_make_quicktime_my_default/
http://cssferret.imgur.com/microsoft_wtf
http://imgur.com/8pct5.png
http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-search.html
http://news.cnet.com/8301-27076_3-20011994-248.html?part=rss&subj=news&tag=2547-1_3-0-20
http://gizmodo.com/5383413/shady-microsoft-plugin-pokes-critical-hole-in-firefox-security
http://i.stack.imgur.com/sl1LY.png
http://imgur.com/T6BMs
http://www.nytimes.com/2010/09/14/world/europe/14raid.html
http://twitter.com/phil_nash/status/21159419598
http://online.wsj.com/article/SB10001424052748704415104576065641376054226.html?mod=WSJASIA_hpp_MIDDLESecondNews
http://www.reddit.com/r/reddit.com/comments/bqqxv/inside_the_chinese_factory_that_makes_microsofts/
http://i.min.us/iX0PA.png
http://imgur.com/m4nuz.gif
http://www.gamesforwindows.com/en-CA/Games/AgeofEmpiresIII/
http://foredecker.wordpress.com/2011/02/27/working-at-microsoft-day-to-day-coding/
http://homepage.mac.com/aleksivic/.Pictures/humor/spotTheBusey.jpg
http://www.bloomberg.com/apps/news?pid=20601087&sid=a7uOT0ro100U&refer=home
http://www.microsoft.com/windowsxp/eula/pro.mspx

Pretty concise, huh?

如何轻松解析HTML作为使用Java的服务消费？ [英] How to easily parse HTML for consumption as a service using Java?

问题描述

另见：

See also:

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

如何轻松解析HTML作为使用Java的服务消费？ [英] How to easily parse HTML for consumption as a service using Java?

问题描述

另见：

See also:

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭