如何根据单个scrapy.Spider的不同命令设置不同的IP? [英] How to set different IP according to different commands of one single scrapy.Spider?

查看:117
本文介绍了如何根据单个scrapy.Spider的不同命令设置不同的IP?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要刮一堆纸,大约20万. 我通常使用Tor和Polipo代理来隐藏我的蜘蛛行为,即使它们是有礼貌的,我们也不知道.因此,如果我登录,使用一个帐户并更改IP是没有用的.因此,这就是为什么我可以在网站上创建多个帐户并使用如下参数设置我的Spider的原因:

I have a bunch of pages to scrape, about 200 000. I usually use Tor and Polipo proxy to hide my spiders behaviors even if they are polite, we never know. So if I login this is useless to use one account and change IP. So that is why I can create several accounts on the website and to set my spider with arguments like in the following:

class ASpider(scrapy.Spider):
    name = "spider"
    start_urls = ['https://www.a_website.com/compte/login']

    def __init__ (self, username=None, password=None):
        self.username = username
        self.password = password

    def parse(self, response):
       token = response.css('[name="_csrf_token"]::attr(value)').get()
       data_log = {
                '_csrf_token': token,
                '_username': self.username,
                '_password': self.password
                 }
        yield scrapy.FormRequest.from_response(response, formdata=data_log, callback=self.after_login) #No matter the rest

并运行多个相同的蜘蛛,例如:

And to run several same spiders like:

scrapy crawl spider -a username=Bidule -a password=TMTC #cmd1

scrapy crawl spider -a username=Truc -a password=TMTC #cmd2

并在我有多个帐户的情况下以几个命令对其进行爬网.

and to crawl it in several commands as I have several accounts.

我设法用spider.py末尾的以下代码检查ip:

I managed to check the ip with the code following at the end of the spider.py:

    yield scrapy.Request('http://checkip.dyndns.org/',meta={'item':item_cheval}, callback=self.checkip)

def checkip(self, response):
    print('IP: {}'.format(response.xpath('//body/text()').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')[0]))

在两个启动的命令中返回相同的IP.因此,我的代理无法为每个蜘蛛提供不同的IP.

it returns the same IP in the both commands launched. So my proxy do not manage to give a different IP to each spider.

有人告诉我有关bindadress的信息,但我不知道它如何工作以及它是否真的能提供我期望的结果.

Someone told me about bindadress but I have no idea how it works and if it really gives what I expect.

注意:我在middlewares.py中使用它:

class ProxyMiddleware(object):

    def process_request(self, request, spider):
        request.meta['proxy'] = settings.get('HTTP_PROXY')

,这在我的settings.py中:

# proxy for polipo
HTTP_PROXY = 'http://127.0.0.1:8123'
....
DOWNLOADER_MIDDLEWARES = {
    'folder.middlewares.RandomUserAgentMiddleware': 400,
    'folder.middlewares.ProxyMiddleware': 410, #Here for proxy
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None}

这些是我在代码中复制的模式,它可以工作,但是我不掌握这项技能.

These are copied patterns I put in my code and it works, but I do not master this skill.

Scrapy版本:1.5.0,Python版本:2.7.9,Tor版本:0.3.4.8,Vidalia:0.2.21

Scrapy version: 1.5.0, Python version: 2.7.9, Tor version: 0.3.4.8, Vidalia: 0.2.21

推荐答案

如果获得代理列表,则可以使用DOWNLOADER_MIDDLEWARES中的'scrapy_proxies.RandomProxy'从列表中为每个新页面选择一个随机代理.

If you get a proxy list then you can use 'scrapy_proxies.RandomProxy' in DOWNLOADER_MIDDLEWARES to chose a random proxy from the list for every new page.

在蜘蛛设置中:

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

PROXY_LIST = 'path/proxylist.txt'
PROXY_MODE = 0

使用此方法,无需向蜘蛛脚本添加任何内容

With this method there is nothing to add to the spider script

这篇关于如何根据单个scrapy.Spider的不同命令设置不同的IP?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆