解析Google自定义搜索引擎结果的最佳方法 [英] best way to parse google custom search engine results

查看:170
本文介绍了解析Google自定义搜索引擎结果的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要解析Google自定义搜索引擎的结果.我的第一个问题是所有内容都在javascript中.下面的页面会加载要解析的结果,该结果会在js弹出窗口中打开.

I need to parse through the results of google custom search engine. My first issue is that it is all in javascript. below page loads the results to be parsed, which opens in a js popup.

<script>
function gcseCallback() {
  if (document.readyState != 'complete')
    return google.setOnLoadCallback(gcseCallback, true);
  google.search.cse.element.render({gname:'gsearch', div:'results', tag:'searchresults-only', attributes:{linkTarget:''}});
  var element = google.search.cse.element.getElement('gsearch');
  element.execute('lectures');
};
window.__gcse = {
  parsetags: 'explicit',
  callback: gcseCallback
};
(function() {
  var cx = 'xxxxxx:xxxxxxx';
  var gcse = document.createElement('script');
  gcse.type = 'text/javascript';
  gcse.async = true;
  gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
    '//www.google.com/cse/cse.js?cx=' + cx;
  var s = document.getElementsByTagName('script')[0];
  s.parentNode.insertBefore(gcse, s);

})();
</script>
<div id="results"></div>

我已经尝试过的没有成功. 硒 汤 HtmlUnit

What I have already tried with no success. Selenium Jsoup HtmlUnit

他们从不加载结果.我知道如果我等待,它将加载JS,但Google自定义搜索引擎并非如此. div id = results中的数据永远不会加载到上述任何一项中.诸如CSS,JS页面调用之类的数据会加载,但不会加载实际结果.我需要在Java中执行此操作.有更好的方法吗?

they never load the results. I know if I put waits in, it will load the JS but that is not the case with google custom search engine. The data in div id=results never loads in any of the above. Data such as css, js page calls load but not the actual results. I need to do this in java. Is there a better way to do this?

是否可以强制页面直接使用html加载而不加载任何javascript?如果这是html格式,那当然会容易得多.也许有一种方法可以在JavaScript加载后转换为html?

Is it possible to force the page to load directly with html without any javascript loads? If this was in html, of course, it would be much easier. Maybe there is a way to convert to html after javascript load?

硒示例

package raTesting;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.openqa.selenium.htmlunit.HtmlUnitDriver;

public class Testing {

    public static void main(String[] args)
    {
        WebDriver driver = new HtmlUnitDriver(BrowserVersion.CHROME);

        driver.get("https://www.google.com/cse/publicurl?q=breaking&cx=005766509181136893168:j_finnh-2pi");

        System.out.println(driver.getPageSource());

          }

在加载URL时,它会显示所有需要扫描的结果.但是来源永远不会返回任何结果.

when the url loads it displays all the results that need to be scanned. but the source never comes back with any results.

推荐答案

适用于仍在寻找的任何人.更改下面的代码以适合您的需求.您将过程放入方法,然后在函数check()中运行它.该函数内部的所有内容都会被循环,直到循环了数组为止.

for anyone still looking. Alter the code below to fit your needs. You put your procedure into method(s) and run that in function check(). Anything inside the function will be looped until it has looped the array.

* 已知问题:* capserjs的运行速度比Google js快.结果是重复的链接.我无法告诉casperjs等待google js弹出窗口首先关闭.

*Known issue: * capserjs runs faster than google js. The result is duplicate links. I haven't been able to tell casperjs to wait for google js popup to close first.

var casper = require("casper").create({
    verbose: true
});
url = casper.cli.get(0)
// The base links array
var links = [
    url
];

// If we don't set a limit, it could go on forever
var upTo = ~~casper.cli.get(0) || 10;

var currentLink = 0;

// Get the links, and add them to the links array
// (It could be done all in one step, but it is intentionally splitted)
function addLinks(link) {
    this.then(function() {
        var found = this.evaluate(searchLinks);
        this.echo(found.length + " links found on " + link);
        links = links.concat(found);
    });
}

// Fetch all <a> elements from the page and return
// the ones which contains a href starting with 'http://'
function searchLinks() {
    var filter, map;
    filter = Array.prototype.filter;
    map = Array.prototype.map;
    return map.call(filter.call(document.querySelectorAll("a"), function(a) {
        return (/^http:\/\/.*/i).test(a.getAttribute("href"));
    }), function(a) {
        return a.getAttribute("href");
    });
}

// Just opens the page and prints the title
function start(link) {
    this.start(link, function() {
        this.echo('Page title: ' + this.getTitle());
    });
}

// As long as it has a next link, and is under the maximum limit, will keep running
function check() {
    if (links[currentLink] && currentLink < upTo) {
        this.echo('--- Link ' + currentLink + ' ---');
        start.call(this, links[currentLink]);
        addLinks.call(this, links[currentLink]);
        currentLink++;
        this.run(check);
    } else {
        this.echo("All done.");
        this.exit();
    }
}

casper.start().then(function() {
    this.echo("Starting");
});

casper.run(check);

src:http://code.ohloh.net/file?fid=VzTcq4GkQhozuKWkprFfBghgXy4&cid=ZDmcCGgIq6k&s=&fp=513476&mp&projSelected=true#L0

这篇关于解析Google自定义搜索引擎结果的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆