如何使用jsoup在Wikipedia文章中提取特定链接? [英] How can I extract specific links in Wikipedia articles using jsoup?
本文介绍了如何使用jsoup在Wikipedia文章中提取特定链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在做一个NLP项目,我需要知道如何提取仅在介绍部分和本wikipidia页面的地理部分中的链接: http://en.wikipedia.org/wiki/Boston
I am doing an NLP project and I need to know how to extract links that only are in the "introduction" section and in the "geography" section of this wikipidia page: http://en.wikipedia.org/wiki/Boston
你能帮忙吗?我?
推荐答案
维基百科并不容易。我并不认为这是优雅的,甚至可以重复使用。
Wikipedia does not make this easy. I don't claim this to be elegant or even very reuseable.
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000).get();
Element intro = doc.body().select("p").first();
while (intro.tagName().equals("p")) {
//here you will get an Elements object which you can
//iterate through to get the links in the intro
System.out.println(intro.select("a"));
intro = intro.nextElementSibling();
}
for (Element h2 : doc.body().select("h2")) {
if(h2.select("span").size() == 2) {
if (h2.select("span").get(1).text().equals("Geography")) {
Element nextsib = h2.nextElementSibling();
while (nextsib != null) {
if (nextsib.tagName().equals("p")) {
//here you will get an Elements object which you
//can iterate through to get the links in the
//geography section
System.out.println(nextsib.select("a"));
nextsib = nextsib.nextElementSibling();
} else if (nextsib.tagName().equals("h2")) {
nextsib = null;
} else {
nextsib = nextsib.nextElementSibling();
}
}
}
}
}
}
这篇关于如何使用jsoup在Wikipedia文章中提取特定链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文