我应该如何修改以解析Google新闻搜索文章标题和内容?预览和网址? [英] How should I modify to parse Google news search article title & preview & URL?

查看:121
本文介绍了我应该如何修改以解析Google新闻搜索文章标题和内容?预览和网址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析Google新闻搜索:1)文章名称2)预览3)URL

I want to parse the Google news search : 1)article name 2) preview 3) URL

要执行此操作,我应该在网络结构中进行修改.

To perform this , I should make modification in web structure.

Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

主要在这里:

(".g> .r> .a")

( ".g>.r>.a")

如何修改它?

完整代码:

  public static void main(String[] args) throws UnsupportedEncodingException, IOException {

    String google = "http://www.google.com/search?q=";

    String search = "stackoverflow";

    String charset = "UTF-8";

    String news="&tbm=nws";


    String userAgent = "ExampleBot 1.0 (+http://example.com/bot)"; // Change this to your company's name and bot homepage!

    Elements links = Jsoup.connect(google + URLEncoder.encode(search , charset) + news).userAgent(userAgent).get().select( ".g>.r>.a");

    for (Element link : links) {
        String title = link.text();
        String url = link.absUrl("href"); // Google returns URLs in format "http://www.google.com/url?q=<url>&sa=U&ei=<someKey>".
        url = URLDecoder.decode(url.substring(url.indexOf('=') + 1, url.indexOf('&')), "UTF-8");

        if (!url.startsWith("http")) {
            continue; // Ads/news/etc.
        }
        System.out.println("Title: " + title);
        System.out.println("URL: " + url);
    }
}

更新

推荐答案

如何选择合适的元素(使用chrome)

第一步:在浏览器中禁用javascript(例如,为方便起见,使用类似uMatrix的插件),因此您会看到与jsoup相同的结果.

First step: disable javascript in you browser (for example using a add on like uMatrix for convenience), so you see the same result as jsoup.

现在右键单击一个元素,然后选择Ctrl + Shift + I检查或打开开发工具.将鼠标悬停在元素"选项卡中的源代码上时,您会在呈现的页面中看到相关的元素.右键单击源代码中的n元素可提供复制->复制选择器.这是一个很好的起点,但有时过于严格.在这里,它为选择器#rso > div:nth-child(3)提供了一个ID rso的元素中的第三个直接子div.太具体了,因此我们将其概括化:

Now right click on a element and choose inspect or open up the dev tools with Ctrl+Shift+I. When you hover over the source code in the Elements tab, you see the related element in the rendered page. Right clicking on a n element in source offers copy -> copy selector. That is a good starting point, but sometimes too strict. Here it gives the selector #rso > div:nth-child(3) so the third direct child div in an element with id rso. That is too specific, so we generalize it:

我们为ID为rso #rso > div的元素选择所有直接子div.

We select all direct child divs for the element with id rso #rso > div.

然后,我们抓住标题锚点h3 > a,textnode和属性href生成标题和网址.

Then we grab the headline anchor h3 > a, textnode and attribute href results in title and url.

接下来,我们使用类st(div.st)捕获内部div,该类在其textnode中包含预览.如果缺少该div,我们将跳过该元素.

Next we grab the inner div with class st (div.st), that contains the preview in its textnode. If that div is missing, we will skip that element.

在请求中使用.data("key","value"),我们无需手动编码.

Using .data("key","value") in the request, we don't need to encode manually.

示例代码

String userAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36";
String searchTerm = "stackoverflow";
int numberOfResultpages = 2; // grabs first two pages of search results
String searchUrl = "https://www.google.com/search?";

Document doc;

for (int i = 0; i < numberOfResultpages; i++) {

    try {
        doc = Jsoup.connect(searchUrl)
                .userAgent(userAgent)
                .data("q", searchTerm)
                .data("tbm", "nws")
                .data("start",""+i)
                .method(Method.GET)
                .referrer("https://www.google.com/").get();

        for (Element result : doc.select("#rso > div")) {

            if(result.select("div.st").size()==0) continue;

            Element h3a = result.select("h3 > a").first();

            String title = h3a.text();
            String url = h3a.attr("href");
            String preview = result.select("div.st").first().text();

            // just printing out title and link to demonstate the approach
            System.out.println(title + " -> " + url + "\n\t" + preview);
        }

    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

输出

Stack Overflow: Movie Magic -> https://geekdad.com/2016/09/stack-overflow-movie-magic-2/
    I got to visit the set of Kubo and the Two Strings and see some of the amazing work that went into creating the film. But well before the ...
Will StackOverflow Documentation Realize Its Lofty Goal? -> https://dzone.com/articles/will-stackoverflow-documentation-realize-its-lofty
    With the StackOverflow Documentation project now in beta, how close is it to realizing the lofty goals it has set forth for itself? Can it ever ...
Stack Overflow: Progress Report -> https://geekdad.com/2016/09/stack-overflow-progress-report/
    Of the books on my list, the only one I totally finished so far is Kidding Ourselves, which I included in this Stack Overflow. And that perhaps is an ...
....

这篇关于我应该如何修改以解析Google新闻搜索文章标题和内容?预览和网址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆