网上抓取Google搜索结果 [英] Web scraping Google search results

查看:183
本文介绍了网上抓取Google搜索结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在逐页抓取Google Scholar搜索结果.在一定数量的页面后,弹出的验证码会中断我的代码.我了解到Google限制了我每小时可以发出的请求.有没有办法解决这个限制?我读了一些有关API的内容,但不确定是否有帮助.

I am web scraping Google Scholar search results page by page. After a certain number of pages, a captcha pops up and interrupts my code. I read that Google limits the requests that I can make per hour. Is there any way around this limit? I read something about APIs, but I'm not sure if that is helpful.

推荐答案

自从我过去从Google抓取以来,我感到您很痛苦.为了完成我的工作,我尝试了以下方法.此列表按从最简单到最困难的技术排序.

I feel your pain since I have done scraping from Google in the past. I have tried the following things in order to get my job done. This list is ordered from easiest to hardest techniques.

  • 每秒限制您的请求:Google和许多其他网站每秒将识别来自同一台计算机的大量请求,并自动阻止它们作为对此StackOverflow答案显示了有关如何将其随机化的示例.
  • 使用启用了Cookie的网络抓取工具库:如果您从头开始编写抓取代码,Google会发现您的请求未返回收到的Cookie.使用良好的库,例如 Scrapy 来解决此问题.
  • 使用多个IP地址:节流肯定会降低您的抓取吞吐量.如果确实需要快速抓取数据,则需要使用多个IP地址,以避免被禁止.有几家公司在Internet上以一定的价格提供这种服务.我使用过 ProxyMesh ,真的很喜欢它们的质量,文档和客户支持.
  • 使用真实的浏览器:如果某些刮板不处理JavaScript或具有图形界面,则某些网站会识别您的刮板.例如,使用具有 Selenium 的真实浏览器将解决此问题.
  • Throttle your requests per second: Google and many other websites will identify a large number of requests per second coming from the same machine and block them automatically as a defensive action against Denial-of-Service attacks. All you need to do is to be gentle and do just 1 request every 1-5 seconds, for instance, to avoid being banned quickly.
  • Randomize your sleep time: Making your code sleep for exactly 1 second is too easy to detect as being a script. Make it sleep for a random amount of time at every iteration. This StackOverflow answer shows an example on how to randomize it.
  • Use a web scraper library with cookies enabled: If you write scraping code from scratch, Google will notice your requests don't return the cookies it received. Use a good library, such as Scrapy to circumvent this issue.
  • Use multiple IP addresses: Throttling will definitely reduce your scraping throughput. If you really need to scrape your data fast, you will need to use several IP addresses in order to avoid being banned. There are several companies providing this kind of service on the Internet for a certain amount of money. I have used ProxyMesh and really liked both their quality, documentation and customer support.
  • Use a real browser: Some websites will recognize your scraper if it doesn't process JavaScript or have a graphical interface. Using a real browser with Selenium, for instance, will solve this problem.

您还可以查看为Web搜索引擎课程编写的我的爬虫项目.在纽约大学.它不会抓取Google 本身,但包含一些上述技术,例如节流和随机化睡眠时间.

You can also take a look at my crawler project, written for the Web Search Engines course at the New York University. It does not scrape Google per se but contains some of the aforementioned techniques, such as throttling and randomizing the sleep time.

这篇关于网上抓取Google搜索结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆