Python - 无法在 Scrapy 中动态旋转 userAgent [英] Python - Unable to rotate userAgent dynamically in Scrapy

查看:46
本文介绍了Python - 无法在 Scrapy 中动态旋转 userAgent的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我覆盖了scrapy模块HttpProxyMiddlewareUserAgentMiddleware的默认实现,我自己的scrapy实现轮换了用户代理和IP地址,它从提供的清单.每个请求的 IP 都在变化,但用户代理没有变化.我无法弄清楚原因.

I am overriding default implemenation of scrapy modules HttpProxyMiddleware and UserAgentMiddleware, and my own implementation of scrapy rotates user-agent and IP address, which picks the values randomly from the list provided. IP is changing for every request but not user-agent. I am unable to figureout the reason.

这是我对类的实现

RotateUserAgentMiddleware

    class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)
            # Add desired logging message here.
            spider.log(
                u'User-Agent: {} {}'.format(request.headers.get('User-Agent'), request)
                      )

代理中间件

class ProxyMiddleware(HttpProxyMiddleware):
    def __init__(self, proxy_ip=''):
        self.proxy_ip = proxy_ip

    def process_request(self,request,spider):
        ip = random.choice(self.proxy_list)
        if ip:

            request.meta['proxy'] = ip
            print(request.meta)
        return request

settings.py 中的 Downloader_Middleware 中所做的更改是;

Changes made in Downloader_Middleware in settings.py are;

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'IpRotation.ProxyMiddleware.ProxyMiddleware': 800,
    'scrapy.downloadermiddleware.useragent.UserAgentMiddleware' : None,
    'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware':790
}

在我的控制台上为每个请求打印Ipuser-agent值:

Printing the Ip and user-agent values on my console for each request:

    2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:46 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '94.*.*.*:3128'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '195.*.*.*:3120'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '213.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '198.*.*.*:80'}
2015-10-09 15:51:47 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '200.*.*.*:80'}
2015-10-09 15:51:48 [dmoz] DEBUG: User-Agent: Scrapy/1.0.3 (+http://scrapy.org) <GET http://www.imdb.com/chart/top>
{'download_timeout': 180.0, 'proxy': '58.*.*.*:80'}

没有更改 settings.py 中的 USER_AGENT 因为我必须随机分配值:

Did not change USER_AGENT in settings.py since I have to assign the value randomly:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'IPProxy (+http://www.yourdomain.com)'

在整个项目中,我不清楚的地方是给Downloader_Middleware赋值.没有人说scrapy可以忽略这个类,但是Integers说什么?请有人从这里帮助我.

In the whole project, the place where I am not clear is assigning the values to the Downloader_Middleware. None says scrapy to ignore the class but what the Integers says? Please someone help me out from here.

推荐答案

将 Downloader_Middleware 中 'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware' 的值更改为小于 400.

Change the value of 'IpRotation.RotateUserAgentMiddleware.RotateUserAgentMiddleware' in Downloader_Middleware to les than 400.

这篇关于Python - 无法在 Scrapy 中动态旋转 userAgent的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆