如何创建的Java Web爬虫? [英] How to create web crawler in java?
本文介绍了如何创建的Java Web爬虫?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
您好我想在Java中创建一个网络爬虫中,我希望以检索从网页标题一样的一些数据,描述和存储在数据库中的DATAS
Hi i want to create a web crawler in java in which i want to retrive some data like title, description from the web page and store the datas in database
推荐答案
如果你想要做自己使用包含的 API 客户/ HttpClient.html相对=nofollow> HttpClient的一>
If you want to do your own use the included HttpClient in the android API.
HttpClient的实例的使用(你只需要分析出:
Example usage of HttpClient (you only need to parse out the :
public class HttpTest {
public static void main(String... args)
throws ClientProtocolException, IOException {
crawlPage("http://www.google.com/");
}
static Set<String> checked = new HashSet<String>();
private static void crawlPage(String url) throws ClientProtocolException, IOException {
if (checked.contains(url))
return;
checked.add(url);
System.out.println("Crawling: " + url);
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet("http://www.google.com");
HttpResponse response = client.execute(request);
Reader reader = null;
try {
reader = new InputStreamReader(response.getEntity().getContent());
Links links = new Links();
new ParserDelegator().parse(reader, links, true);
for (String link : links.list)
if (link.startsWith("http://"))
crawlPage(link);
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
static class Links extends HTMLEditorKit.ParserCallback {
List<String> list = new LinkedList<String>();
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A)
list.add(a.getAttribute(HTML.Attribute.HREF).toString());
}
}
}
这篇关于如何创建的Java Web爬虫?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文