使用带有scrapy框架的tor [英] using tor with scrapy framework

查看:58
本文介绍了使用带有scrapy框架的tor的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取网站,该网站足够复杂以阻止机器人,我的意思是在 Scrapy 挂起之后,它只允许少数请求.

I am trying to crawl website, which is sophisticated enough to stop bots, I mean it is permitting only a few requests, after that Scrapy hangs.

问题 1:有没有办法,如果 Scrapy 挂起,我可以从同一点重新开始我的爬行过程.为了摆脱这个问题,我这样写了我的设置文件

Question 1: is there a way, if Scrapy hangs I can restart my crawling process from the same point. To get rid of this problem, I wrote my settings file like this

BOT_NAME = 'MOZILLA'
BOT_VERSION = '7.0'

SPIDER_MODULES = ['yp.spiders']
NEWSPIDER_MODULE = 'yp.spiders'
DEFAULT_ITEM_CLASS = 'yp.items.YpItem'
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

DOWNLOAD_DELAY = 0.25
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'

这是我的程序:

class ypSpider(CrawlSpider):

   name = "yp"


   start_urls = [
       SOME URL

   ]
   rules=(
      #These are some rules
   )
   def parse_item(self, response):
   ####################################################################
   #cleaning the html page by removing scripts html tags    
   #######################################################
   hxs=HtmlXPathSelector(response)

问题是我可以在哪里编写http代理,我是否必须导入任何与tor相关的类,我是Scrapy的新手,因为我学到了很多东西,现在我正在尝试学习如何使用ip轮换或 to'

The question is where I could write the http proxies and shall i have to import any tor related classes, I am new to Scrapy because of this group I learned so much, Now I am trying to learn "how to use ip rotation or tor'

正如我们的一位成员所建议的,我启动了 Tor 并将 HTTP_PROXY 设置为

As one of our member suggested, I started tor and I set HTTP_PROXY to

set http_proxy=http://localhost:8118

但它抛出了一些错误,

failure with no frames>: class 'twisted.internet.error.ConnectionRefusedError'   Connection was refused by other side 10061: No connection could be made because the target machine actively refused it.

所以我将 http_proxy 改为

So i changed http_proxy to

set http_proxy=http://localhost:9051

现在的错误是

failure with no frames>: class 'twisted.internet.error.ConnectionDone' connection was closed cleanly.

我检查了 Firefox 网络设置,在那里我看不到任何 http 代理,但它使用的是 SOCKSV5,而是显示 127.0.0.1:9051.(在 TOR 之前它无需代理即可工作)请帮助我我仍然不明白如何通过 Scrapy 使用 TOR.我应该使用哪个 TOR 包以及如何使用?我希望我的两个问题都能得到解决

I checked firefox network settings, there I couldn't see any http proxies but instead of that Its using SOCKSV5, there it is showing 127.0.0.1:9051. (before TOR it works with no proxies)Please help me I am still not understanding how to use TOR through Scrapy. Which bundle of TOR I am supposed to use and how? I hope that both of my questions will be resolved

  1. 如果爬虫由于某种原因挂起(连接失败),我想从那里恢复服务
  2. 如何在 Scrapy 中使用轮换 IP

推荐答案

TOR 本身不是 http 代理,端口 8118 和连接拒绝错误表明您没有正确运行 privoxy[1].尝试正确设置 privoxy,然后使用环境变量 http_proxy=http://localhost:8118 重试.

TOR by itself is not an http proxy, the port 8118 and the connection refused error suggest that you don't have privoxy[1] running properly. Try setting up privoxy correctly and then try again using the environment variable http_proxy=http://localhost:8118.

我已经成功地使用 privoxy 和 scrapy 爬取了 TOR.

I have done crawling through TOR using privoxy with scrapy successfully.

[1] http://www.privoxy.org/

这篇关于使用带有scrapy框架的tor的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆