如何轻松解析HTML作为使用Java的服务消费? [英] How to easily parse HTML for consumption as a service using Java?

查看:131
本文介绍了如何轻松解析HTML作为使用Java的服务消费?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析HTML,例如 http: //www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top

,并且只想提取具有< a class =title



到目前为止,我所看过的选项都看起来像过度杀毒(SAX,DOM遍历) / p>

解决方案

使用 jsoup 。它支持类似jQuery的CSS选择器。以下是一个启示示例:

 字符串url =http://www.reddit.com/r/reddit.com/search q =微软和放大器;排序=顶部 
Document document = Jsoup.connect(url).get();
for(元素链接:document.select(a.title)){
System.out.println(link.absUrl(href));

结果:

 
http://news.cnet.com/8301-13579_3-10288022-37.html
http://dl.getdropbox.com/u/18264/mspoland.jpg
http://www.reddit.com/r/reddit.com/comments/ar5z1/verizon_stealthily_installed_a_bing_search_app_on/
http://www.grabup.com/uploads/240ccede5360b093dbf298f8946025a5.png
http:/ /www.youtube.com/watch?v=7Ym0tZSWGMc&fmt=34
http://i42.tinypic.com/wv5qar.jpg
http://www.reddit.com/r/technology/comments / 8hnya / apple_no_i_dont_want_to_make_quicktime_my_default /
http://cssferret.imgur.com/microsoft_wtf
http://imgur.com/8pct5.png
http://googleblog.blogspot.com/2011 /02/microsofts-bing-uses-google-search.html
http://news.cnet.com/8301-27076_3-20011994-248.html?part=rss&subj=news&tag=2547-1_3-0- 20
http://gizmodo.com/5383413/shady-microsoft-plugin-pokes-critical-hole-in-firefox-security
http://i.stack.imgur.com/sl1LY。 png
http://imgur.com/T6BMs
http://www.nytimes.c om / 2010/09/14 / world / europe / 14raid.html
http://twitter.com/phil_nash/status/21159419598
http://online.wsj.com/article/SB10001424052748704415104576065641376054226。 html?mod = WSJASIA_hpp_MIDDLESecondNews
http://www.reddit.com/r/reddit.com/comments/bqqxv/inside_the_chinese_factory_that_makes_microsofts/
http://i.min.us/iX0PA.png
http://imgur.com/m4nuz.gif
http://www.gamesforwindows.com/en-CA/Games/AgeofEmpiresIII/
http://foredecker.wordpress.com/2011 / 02/27 / working-at-microsoft-day-to-day-coding /
http://homepage.mac.com/aleksivic/.Pictures/humor/spotTheBusey.jpg
http:/ /www.bloomberg.com/apps/news?pid=20601087&sid=a7uOT0ro100U&refer=home
http://www.microsoft.com/windowsxp/eula/pro.mspx

非常简洁,呵呵?

另见:




I want to parse an HTML such as http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top
and only want extract the text of the element which has <a class="title"

The options I have looked so far all look like overkill (SAX, DOM traversal).

解决方案

Use Jsoup. It supports jQuery-like CSS selectors. Here's a kickoff example:

String url = "http://www.reddit.com/r/reddit.com/search?q=Microsoft&sort=top";
Document document = Jsoup.connect(url).get();
for (Element link : document.select("a.title")) {
    System.out.println(link.absUrl("href"));
}

Result:

http://news.cnet.com/8301-13579_3-10288022-37.html
http://dl.getdropbox.com/u/18264/mspoland.jpg
http://www.reddit.com/r/reddit.com/comments/ar5z1/verizon_stealthily_installed_a_bing_search_app_on/
http://www.grabup.com/uploads/240ccede5360b093dbf298f8946025a5.png
http://www.youtube.com/watch?v=7Ym0tZSWGMc&fmt=34
http://i42.tinypic.com/wv5qar.jpg
http://www.reddit.com/r/technology/comments/8hnya/apple_no_i_dont_want_to_make_quicktime_my_default/
http://cssferret.imgur.com/microsoft_wtf
http://imgur.com/8pct5.png
http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-google-search.html
http://news.cnet.com/8301-27076_3-20011994-248.html?part=rss&subj=news&tag=2547-1_3-0-20
http://gizmodo.com/5383413/shady-microsoft-plugin-pokes-critical-hole-in-firefox-security
http://i.stack.imgur.com/sl1LY.png
http://imgur.com/T6BMs
http://www.nytimes.com/2010/09/14/world/europe/14raid.html
http://twitter.com/phil_nash/status/21159419598
http://online.wsj.com/article/SB10001424052748704415104576065641376054226.html?mod=WSJASIA_hpp_MIDDLESecondNews
http://www.reddit.com/r/reddit.com/comments/bqqxv/inside_the_chinese_factory_that_makes_microsofts/
http://i.min.us/iX0PA.png
http://imgur.com/m4nuz.gif
http://www.gamesforwindows.com/en-CA/Games/AgeofEmpiresIII/
http://foredecker.wordpress.com/2011/02/27/working-at-microsoft-day-to-day-coding/
http://homepage.mac.com/aleksivic/.Pictures/humor/spotTheBusey.jpg
http://www.bloomberg.com/apps/news?pid=20601087&sid=a7uOT0ro100U&refer=home
http://www.microsoft.com/windowsxp/eula/pro.mspx

Pretty concise, huh?

See also:

这篇关于如何轻松解析HTML作为使用Java的服务消费?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆