Java Web Crawler,用于检索Google搜索结果 [英] Java Web Crawler for retrieving google search results

查看:81
本文介绍了Java Web Crawler,用于检索Google搜索结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题已经被问过很多次了.但是,某些API随时间变化了,我想知道一种实现此问题的好方法.

This question has been asked many times before. However some of the APIs have changed over time and I want to know a good way to implement this.

最好的方法是使用Google搜索API.但是, https://developers.google.com/custom-search/json-api/v1/overview 告诉您每天只有100条免费搜索查询.我需要更多,而且我不想花钱去做.

The best way to this would be using google search api. However, https://developers.google.com/custom-search/json-api/v1/overview tells there are only 100 free search queries per day. I will require more and I dont want to spend money to do it.

我使用简单的REST API进行了尝试,但是它主要是JavaScript代码,并且在响应中似乎找不到我需要的东西.

I tried it using simple REST apis, however its mostly JavaScript code and I don't seem to find what I need in the response.

我尝试使用某些库,例如 http://jsoup.org/,但是,即使它的响应也不会包含我需要的信息.

I tried using some libraries like http://jsoup.org/ , however, even its response doesn't contain the information I need.

推荐答案

我尝试使用Jsoup并成功了,尽管前几个结果包括一些不需要的字符.下面是我的代码

I tried using Jsoup and it worked, although the first few results include some undesired characters. Below is my code

package crawl_google;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class googleResults {
public static void main(String[] args) throws Exception{
//pass the search query and the number of results as parameters
google_results("Natural Language Processing", 10);
}
public static void google_results(String keyword, int no_of_results) throws Exception
{
//Replace space by + in the keyword as in the google search url
keyword = keyword.replace(" ", "+");
String url = "https://www.google.com/search?q=" + keyword + "&num=" + String.valueOf(no_of_results);
//Connect to the url and obain HTML response
Document doc = Jsoup
.connect(url)
.userAgent("Mozilla")
.timeout(5000).get();
//parsing HTML after examining DOM
Elements els = doc.select("li.g");
for(Element el : els)
{
//Print title, site and abstract
System.out.println("Title : " + el.getElementsByTag("h3").text());
System.out.println("Site : " + el.getElementsByTag("cite").text());
System.out.println("Abstract : " + el.getElementsByTag("span").text() + "\n");
}
}
}

这篇关于Java Web Crawler,用于检索Google搜索结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆