Google 允许一个请求抓取多少个结果? [英] How many results does Google allow a request to scrape?

查看:55
本文介绍了Google 允许一个请求抓取多少个结果?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的 PHP 代码工作正常,但是当它用于抓取指定关键字的 1000 个 Google 结果时,它只返回 100 个结果.Google 对返回的结果是否有限制,还是有其他问题?

The following PHP code works fine, but when it is used to scrape 1000 Google results for a specified keyword, it only returns 100 results. Does Google have a limit on results returned, or is there a different problem?

<?php
require_once ("header.php");
$data2 = getContent("http://www.google.de/search?q=auch&hl=de&num=100&gl=de&ix=nh&sourceid=chrome&ie=UTF-8");
    $dom = new DOMDocument();
    @$dom->loadHtml($data2);
    $xpath = new DOMXPath($dom);

    $hrefs = $xpath->evaluate("//div[@id='ires']//li/h3/a/@href");
    $j = 0;

    foreach ($hrefs as $href)
    {            

        $url = "http://www.google.de/" . $href->value . "";
        echo "<b>";

        echo "$j ";
      echo   $url = get_string_between($url, "http://www.google.de//url?q=", "&sa=");
      echo "<br/>";

      $j++;
        }
?>

推荐答案

你已经接受了一个答案,无论如何,如果你还在你的项目中:

You already accepted an answer, anyway if you are still on your project:

正如人们所指出的,Google 不喜欢被抓取.他们的条款不允许这样做,所以如果你同意他们,你就会通过自动访问该网站来破坏他们.然而,谷歌本身并不在意网站的访问权限.甚至 Bing 由 Google 提供支持并因此被抓到,我想大多数其他搜索引擎也借鉴了 Google.

As people noted, Google does not like to be scraped. It's not allowed by their terms, so if you agreed to them you break them by automatically accessing the site. However, Google itself did not care about permission to access websites when they started. Even Bing was powered by Google and got caught doing that, I guess most other search engines also borrow from Google.

如果您必须抓取 Google,请将比率保持在其检测率以下.不要打击他们,因为这只会让您的项目扎根,而且 Google 会更加关注自动访问,这通常会使我们更难.

If you must scrape Google, keep the rate below their detection ratio. Don't hammer them as this only will get your project grounded and Google get more concerned about automated accesses which can make it harder for us in general.

根据我的经验,您可以以每小时 15 到 20 个请求(使用一个 IP)的速度长期访问 Google,而不会被阻止.当然,您的代码需要模拟浏览器并正常运行.更高的费率会让您首先(通常)被临时验证码阻止.解决验证码会创建一个允许您继续的 cookie.我见过长期验证码,也见过一个 IP 和大型子网的永久块.所以规则#1:不要被检测到,如果你被检测到然后自动停止你的刮刀.

From my experience you can access Google at a rate of 15 up to 20 requests per hour (with one IP) longterm without getting blocked. Of course your code needs to simulate a browser and behave properly. Higher rates will get you blocked, first (usually) by a temporary captcha. Solving the captcha creates a cookie which allows you to continue. I have seen longterm captchas and I have seen permanent blocks of one IP and of large subnets. So rule #1: Do not get detected, if you get detected then automatically stop your scraper.

所以这有点棘手,但如果您依靠这种方式获取数据,请查看 http://scraping.compunect.com/这是一个 PHP 代码,它可以抓取多个关键字和多个页面并管理 IP 地址,因此它们不会被阻止.我正在为项目使用该代码,到目前为止它有效.

So it is a bit tricky but if you rely on getting the data out that way, take a look at the open source PHP project at http://scraping.compunect.com/ That's a PHP code which can scrape multiple keywords and multiple pages and manages IP addresses so they do not get blocked. I am using that code for projects, it works so far.

如果你只需要从谷歌收集少量数据,而真正的排名并不重要,看看他们的 API.如果排名很重要,或者如果您需要大量数据,则需要像我链接的那样的 Google 抓取工具.

If you just need to gather a small amount of data from Google and the real ranking is not important, take a look at their API. If ranking matters or if you need a lot of data you'll need a Google scraper like the one I linked.

顺便说一句,PHP 非常适合该任务,但您应该将其作为本地脚本运行,而不是通过 Apache.

Btw, PHP is quite well suited for the task but you should run it as a local script and not through Apache.

这篇关于Google 允许一个请求抓取多少个结果?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆