避免 Google Scholar 阻止抓取 [英] Avoiding Google Scholar block for crawling

查看:58
本文介绍了避免 Google Scholar 阻止抓取的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下 python 脚本从 python 中抓取谷歌学者:

I have used the following python scripts to crawl google scholar from python:

import urllib

filehandle = urllib.urlopen('http://www.techyupdates.blogspot.com')

for lines in filehandle.readlines():
   print lines

filehandle.close()

但我反复这样做所以我被网站谷歌学者阻止说:

but I am doing it repeatedly so I am getting blocked by the site-google scholar saying:

当 Google 自动检测到来自您的计算机网络的似乎违反了条款的请求时,会显示此页面服务.该块将在这些请求停止后不久到期.同时,解决....

This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving ....

有什么简单的方法可以避免这种情况吗?有什么建议吗?

Is there an easy way to avoid this? Any suggestions?

推荐答案

[edit]

在您的脚本中加入某种限制,以便您轻松加载 Google 学术搜索(例如,在查询之间等待 60 秒、600 秒或 6000 秒).

Put some kind of throttling into your script so you lightly load Google Scholar (wait for 60s or 600s or 6000s between queries, for example).

我的意思是轻松加载 Google 学术搜索.如果可以缓存 Google Scholar 结果,这也将减少 Google Scholar 负载.

And I do mean lightly load Google Scholar. If caching the Google Scholar results is possible, that would also reduce the Google Scholar load.

您还应该考虑批处理,这样您就可以以稳定但缓慢的速度整夜运行爬网.

You should also look at batch processing, so you can run your crawl overnight at a steady but slow speed.

目标是 Google 学术搜索不应该关心您的额外查询,从而实现 ToS 的精神(如果不是字母的话).但是,如果您能同时满足两者,那就是正确的做法.

The goal is that Google Scholar should not care about your additional queries, thereby fulfilling the spirit of the ToS if not the letter. But if you can fulfill both, that would be the Right Thing to Do.

这篇关于避免 Google Scholar 阻止抓取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆