使用jsoup提取文本的某些部分 [英] use jsoup extract certain part of text

查看:73
本文介绍了使用jsoup提取文本的某些部分的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个网页,其中的源包含以下几种相似的结构:

I have an webpage with source contains several similar structure like below:

<tr>
<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">1-Jun-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">Another Text</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link2">Here is also Text</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="LINKtoWeb" class=list><u>STRING TO CAPTURE</u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="AnotherLink"><img src="img/img2.gif" border="0"></a></td>
</tr>

这种结构重复了很多次,里面有不同的文本,但是我只想提取此集合,因为文本"STRING TO CAPTURE"是第一次出现在这里.因此,如何使用Jsoup仅提取此集合及其之间的可见文本以及url

This kind of structure repeated many time with different text inside, but I only want to extract this set because the text "STRING TO CAPTURE" appear here FIRST TIME. So how do I use Jsoup to extract only this set, and the visible text between it, as well as the url

AnotherLink

在文本"STRING TO CAPTURE"的行中出现

吗?我是Jsoup的新手,所以我只尝试了

at the line of the text "STRING TO CAPTURE" appears ? I am new to Jsoup, so I only tried this

  Document doc = Jsoup.connect("http://www.website.com").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); 
String absHref = link.attr("abs:href"); 
String text = doc.body().text();
String linkHref = link.attr("href"); 
String linkText = link.text(); 

  System.out.println("link:" + link);
  System.out.println("text:" + text);

但是不能为此目的提前做,请给我一些建议!谢谢!

but cant do it in advance for this purpose, please give me some advices ! Thank you !

推荐答案

使用此测试输入:

String test = "<html><body><table>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>TEXT THAT DOESN'T MATCH</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"NotMatchLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>STRING TO CAPTURE</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"AnotherLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>MORE TEXT THAT DOESN'T MATCH</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"NotMatchLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>STILL MORE TEXT THAT DOESN'T MATCH</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"NotMatchLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "</table></body></html>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">Second 1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Second Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Second Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Second Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>STRING TO CAPTURE</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"SecondAnotherLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";

这段代码:

final Document document = Jsoup.parse(test);
final Element entireRow = document.select("tr:contains(STRING TO CAPTURE)").get(0);
for (final Element column : entireRow.select("td")) {
    System.out.println("Column text is: " + column.text());
}
final Elements link = entireRow.select("td:contains(STRING TO CAPTURE) + td > a[href]");
System.out.println("Target link is: " + link.attr("href"));

它输出:

Column text is: 1-Jun-2013
Column text is: Sat
Column text is: 
Column text is: Another Text
Column text is: 
Column text is: Here is also Text
Column text is: STRING TO CAPTURE
Column text is: 
Target link is: AnotherLink

这篇关于使用jsoup提取文本的某些部分的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆