使用 Tor + Privoxy 抓取谷歌购物结果:如何避免阻止? [英] Using Tor + Privoxy to scrape google shopping results: How to avoid block?

查看:73
本文介绍了使用 Tor + Privoxy 抓取谷歌购物结果:如何避免阻止?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经在我的服务器上安装了 Tor + Privoxy,它们运行良好!(已测试).但是现在当我尝试使用 urllib2 (python) 来抓取谷歌购物结果时,当然使用代理,我总是被谷歌阻止(有时是 503 错误,有时是 403 错误).所以任何人都有任何解决方案可以帮助我避免这个问题?将不胜感激!

我使用的源代码:

I have installed Tor + Privoxy on my server and they're working fine! (Tested). But now when I try to use urllib2 (python) to scrape google shopping results, using proxy of course, I always get blocked by google (sometimes 503 error, sometimes 403 error). So anyone have any solutions can help me avoid that problem? It would be very appreciated!!

The source code that I am using:

 _HEADERS = {
      'User-Agent': 'Mozilla/5.0',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Accept-Encoding': 'deflate',
      'Connection': 'close',
      'DNT': '1'
  }

  request = urllib2.Request("https://www.google.com/#q=iphone+5&tbm=shop", headers=self._HEADERS)

  proxy_support = urllib2.ProxyHandler({"http" : "127.0.0.1:8118"})
  opener = urllib2.build_opener(proxy_support) 
  urllib2.install_opener(opener)

  try:
      response = urllib2.urlopen(request)
      html = response.read()
      print html

   except urllib2.HTTPError as e:
       print e.code
       print e.reason


注意:当我不使用代理时,它可以正常工作!


Note that: When I don't use proxy, it can work fine!

推荐答案

Google 阻止了许多出口 Tor 节点,因为 Google 收到了许多来自它们的请求.所以这个错误是概率问题,改变你的退出 Tor 节点,直到找到一个不被谷歌阻止的节点.

Google blocks many of exit Tor nodes because Google receive many requests from them. So this error is question of probability, change your exit Tor node until find one without be blocked by Google.

https://www.torproject.org/docs/faq.html.en#GoogleCAPTCHA

这篇关于使用 Tor + Privoxy 抓取谷歌购物结果:如何避免阻止?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆