使用JSoup抓取Google结果 [英] Using JSoup to scrape Google Results

查看:82
本文介绍了使用JSoup抓取Google结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用JSoup从Google抓取搜索结果.目前这是我的代码.

I'm trying to use JSoup to scrape the search results from Google. Currently this is my code.

public class GoogleOptimization {
public static void main (String args[])
{
    Document doc;
    try{
        doc = Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
        Elements links = doc.select("what should i put here?");
        for (Element link : links) {
                System.out.println("\n"+link.text());
    }
    }
    catch (IOException e) {
        e.printStackTrace();
    }
}

}

我只是想获取搜索结果的标题以及标题下方的摘录.是的,我只是不知道要搜寻这些元素要查找哪些元素.如果有人有更好的方法使用Java抓取Google,我想知道.

I'm just trying to get the title of the search results and the snippets below the title. So yea, I just don't know what element to look for in order to scrape these. If anyone has a better method to scrape Google using java I would love to know.

谢谢.

推荐答案

在这里.

public class ScanWebSO 
{
public static void main (String args[])
{
    Document doc;
    try{
        doc =        Jsoup.connect("https://www.google.com/search?as_q=&as_epq=%22Yorkshire+Capital%22+&as_oq=fraud+OR+allegations+OR+scam&as_eq=&as_nlo=&as_nhi=&lr=lang_en&cr=countryCA&as_qdr=all&as_sitesearch=&as_occt=any&safe=images&tbs=&as_filetype=&as_rights=").userAgent("Mozilla").ignoreHttpErrors(true).timeout(0).get();
        Elements links = doc.select("li[class=g]");
        for (Element link : links) {
            Elements titles = link.select("h3[class=r]");
            String title = titles.text();

            Elements bodies = link.select("span[class=st]");
            String body = bodies.text();

            System.out.println("Title: "+title);
            System.out.println("Body: "+body+"\n");
        }
    }
    catch (IOException e) {
        e.printStackTrace();
    }
}
}

此外,我建议您自己使用chrome.您只需右键单击要刮取的任何内容,然后检查元素.它将带您到该元素位于html的确切位置.在这种情况下,您首先要找出所有结果列表的根目录在哪里.找到该元素时,您要指定元素,最好指定一个唯一的属性进行搜索.在这种情况下,根元素是

Also, to do this yourself I would suggest using chrome. You just right click on whatever you want to scrape and go to inspect element. It will take you to the exact spot in the html where that element is located. In this case you first want to find out where the root of all the result listings are. When you find that, you want to specify the element, and preferably an unique attribute to search it by. In this case the root element is

<ol eid="" id="rso">

下面,您会看到一堆以

<li class="g"> 

这是要放入初始元素数组中的内容,然后对于每个元素,您都希望找到标题和正文所在的位置.在这种情况下,我发现标题位于

This is what you want to put into your initial elements array, then for each element you will want to find the spot where the title and body are. In this case, I found the title to be under the

<h3 class="r" style="white-space: normal;">

元素.因此,您将在每个清单中搜索该元素.身体也一样.我发现主体位于下面,因此我使用.text()方法进行搜索,然后返回该元素下的所有文本.关键是始终尝试查找具有原始属性的元素(使用类名是理想的).如果不这样做,仅搜索"div"之类的内容,它将在整个页面中搜索包含div的ANY元素并返回该元素.因此,您将获得比您想要的更多的结果.我希望这能很好地解释它.让我知道您是否还有其他问题.

element. So you will search for that element in each listing. The same goes for the body. I found the body to be under so I searched for that using the .text() method and it returned all the text under that element. The key is to ALWAYS try and find the element with an original attribute (using a class name is ideal). If you don't and only search for something like "div" it will search the entire page for ANY element containing div and return that. So you will get WAY more results than you want. I hope this explains it well. Let me know if you have any more questions.

这篇关于使用JSoup抓取Google结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆