如何为以下创建JSOUP选择器 [英] How to create JSOUP selector for the following

查看:65
本文介绍了如何为以下创建JSOUP选择器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我想提取本文HTML中的文本:

For example I want to extract the text in this article HTML:

    <div class="description">
            <div style="clear: none;" class="post-fb-like">
              <fb:like class=" fb_edge_widget_with_comment fb_iframe_widget" href="http://mashable.com/2011/08/07/3-handy-mobile-apps/" send="true" width="625" height="61"><span><iframe src="http://www.facebook.com/plugins/like.php?api_key=116628718381794&amp;channel_url=http%3A%2F%2Fstatic.ak.fbcdn.net%2Fconnect%2Fxd_proxy.php%3Fversion%3D3%23cb%3Df138585052991e8%26origin%3Dhttp%253A%252F%252Fmashable.com%252Ff15a8eb75cc2b58%26relation%3Dparent.parent%26transport%3Dpostmessage&amp;href=http%3A%2F%2Fmashable.com%2F2011%2F08%2F07%2F3-handy-mobile-apps%2F&amp;layout=standard&amp;locale=en_US&amp;node_type=link&amp;sdk=joey&amp;send=true&amp;show_faces=true&amp;width=625" class="fb_ltr" title="Like this content on Facebook." style="border: medium none; overflow: hidden; height: 29px; width: 625px;" name="f2d40595a65cf36" id="f24fece5e565ec4" scrolling="no"></iframe></span></fb:like>
            </div>
                        <p><img src="http://ec.mashable.com/wp-content/uploads/2009/01/bizspark2.gif" alt="" align="left"><em>The <a href="http://mashable.com/tag/bizspark">Spark of Genius Series</a> highlights a unique feature of startups and is made possible by <a rel="nofollow" href="http://www.microsoftstartupzone.com/BizSpark/Pages/At_a_Glance.aspx?WT.mc_id=MSZ_Mashable_posts" target="_blank">Microsoft BizSpark</a>. If you would like to have your startup considered for inclusion, please see the details <a href="http://mashable.com/bizspark/">here</a>.</em></p>

<p><img src="http://5.mshcdn.com/wp-content/uploads/2011/08/mobile-devices.jpg" alt="" title="mobile devices" class="alignright" height="141" width="225">Each <a href="http://mashable.com/follow/topics/startup-weekend-roundup">weekend</a>, <em>Mashable</em> hand-picks startups we think are building interesting, unique or niche products. </p>
<p>This week, we’ve rounded up startups making mobile applications that bridge the physical and digital worlds for improved communication and enhanced experiences. </p>
<p>TransFire breaks down global communication barriers with its instant and automatic translation capabilities, while Babbleville facilitates neighbor-to-neighbor communication around events or topics. And, Picdish uses time and place to bring friends together over shared mobile food experiences.</p>
<hr>

我还有另一个HTML页面,我也想从中提取文本,但是其格式不同.我想从 http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2

And I have another HTML page I want to extract text from too, but its in different format. I want to extract this text from http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2

无论给出哪个文章网址,我将如何创建一个选择器来提取文本?

How would I go about creating a selector to extract the text no matter which article url is given?

推荐答案

无论给出哪个文章网址,我该如何创建一个选择器来提取文本?

不能.所有网站都有自己的HTML结构.自己在Web浏览器中打开页面,右键单击并查看源代码.看.您应该为每个单独的网站创建一个单独的选择器.

You can't. All websites have their own HTML structure. Open the page in the webbrowser yourself, rightclick and View Source. Look. You should create a separate selector for each individual website.

对于第一个示例,假设它是整个 HTML,则文本位于这些<p>标记内.然后,您可以使用

For your first example, assuming that it's the whole HTML, the text is thus inside those <p> tags. You can then use

Document html = Jsoup.parse(yourHtmlString);
Elements paragraphs = html.select("p");
String text = paragraphs.text();
// ...

对于您的CNN网站,根据HTML源,您希望获取<div class="cnn_strycntntlft">的所有<p>,因此此选择器应执行以下操作:

For your CNN site, according the HTML source you'd like to get all <p>s of the <div class="cnn_strycntntlft">, so this selector should do:

Document document = Jsoup.connect("http://www.cnn.com/2011/WORLD/europe/08/12/uk.riots.dan.rivers/index.html?hpt=hp_c2").get();
Elements paragraphs = document.select(".cnn_strycntntlft p");
String text = paragraphs.text();
// ...

顺便说一句,使用RSS提要而不是解析整个HTML会更容易.许多新闻网站正是出于此目的提供RSS feed.

By the way, it would be easier to just use their RSS feeds instead of parsing the whole HTML. Lot of news sites provides RSS feeds for exactly this purpose.

这篇关于如何为以下创建JSOUP选择器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆