使用Jsoup提取文本 [英] Extracting text with Jsoup
本文介绍了使用Jsoup提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正试图从以下页面获取信息: http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741
I am trying to get information from the following page: http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741
我需要为每个项目获取单独的字符串:
I need to get separate strings for each of these items:
- 新闻标题
- 新闻
- 分析
现在,我可以使用以下方法从整个表格中获取信息:
Right now I am able to get information from the whole table using:
doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/" + playerId).timeout(30000).get();
Element title = doc.select("[id*=newsPage1]").first();
但这是所有文章一起运行的结果.
But the result of this is all of the articles run together.
有人可以建议吗?
谢谢乔什
推荐答案
您需要使用更复杂的CSS选择器.也许像这样:
You need to use more elaborate css selectors. Maybe something like:
public static void main(String[] args) {
Pattern pat = Pattern.compile("(.*)News\\:\\p{Zs}(.*)Analysis\\:\\p{Zs}(.*)", Pattern.UNICODE_CASE);
Document doc = null;
try {
doc = Jsoup.connect("http://fantasynews.cbssports.com/fantasyfootball/players/updates/187741").userAgent("Mozilla").get();
} catch (IOException e1) {
e1.printStackTrace();
System.exit(0);
};
Elements titles = doc.select("table h3");
for (Element title : titles){
Element td = title.parent();
String innerTxt = td.text();
Matcher mat = pat.matcher(innerTxt);
if (mat.find()){
System.out.println("titel = " + mat.group(1));
System.out.println("news = " + mat.group(2));
System.out.println("analysis = " + mat.group(3));
}
}
}
我建议您研究CSS选择器和 JSoup文档.
I suggest you look into css selectors and the JSoup documentation.
这篇关于使用Jsoup提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文