Scrapy - 在请求中更改用户代理的正确方法 [英] Scrapy - Correct way to change User Agent in Request

查看:45
本文介绍了Scrapy - 在请求中更改用户代理的正确方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过覆盖 RetryMiddleware 在 Scrapy 中创建了一个自定义中间件,该中间件在重试之前更改了代理和用户代理.看起来像这样

class CustomRetryMiddleware(RetryMiddleware):def _retry(自我,请求,原因,蜘蛛):重试 = request.meta.get('retry_times', 0) + 1如果重试 <= self.max_retry_times:Proxy_UA_Middleware.switch_proxy()Proxy_UA_Middleware.switch_ua()logger.debug("重试 %(request)s (失败 %(retries)d 次): %(reason)s",{'请求':请求,'重试':重试,'原因':原因},extra={'蜘蛛':蜘蛛})retryreq = request.copy()retryreq.meta['retry_times'] = 重试retryreq.dont_filter = Trueretryreq.priority = request.priority + self.priority_adjust返回重试请求别的:logger.debug("放弃重试 %(request)s (失败 %(retries)d 次): %(reason)s",{'请求':请求,'重试':重试,'原因':原因},extra={'蜘蛛':蜘蛛})

Proxy_UA_Middlware 类很长.基本上它包含更改代理和用户代理的方法.我在 settings.py 文件中正确配置了这两个中间件.代理部分工作正常,但用户代理没有改变.我用来改变用户代理的代码看起来像这样

request.headers.setdefault('User-Agent', self.user_agent)

其中 self.user_agent 是从用户代理数组中获取的随机值.这不起作用.但是,如果我这样做

request.headers['User-Agent'] = self.user_agent

然后它工作得很好,并且每次重试时用户代理都会成功更改.但是我还没有看到有人使用这种方法来更改用户代理.我的问题是,以这种方式更改用户代理是否可以,否则我做错了什么?

解决方案

如果你总是想控制在那个中间件上使用哪个用户代理,那么没关系,setdefault 所做的就是检查之前是否没有分配User-Agent,这是可能的,因为其他中间件可能会这样做,甚至从蜘蛛那里分配.

另外我认为你也应该禁用默认的 UserAgentMiddleware 或者甚至为你的中间件设置更高的优先级,检查 UserAgentMiddleware 优先级为 400,因此将您的优先级设置为之前(400 之前的某个数字).>

I have created a custom Middleware in Scrapy by overriding the RetryMiddleware which changes both Proxy and User-Agent before retrying. It looks like this

class CustomRetryMiddleware(RetryMiddleware):
    def _retry(self, request, reason, spider):
        retries = request.meta.get('retry_times', 0) + 1

        if retries <= self.max_retry_times:
            Proxy_UA_Middleware.switch_proxy()
            Proxy_UA_Middleware.switch_ua()
            logger.debug("Retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['retry_times'] = retries
            retryreq.dont_filter = True
            retryreq.priority = request.priority + self.priority_adjust
            return retryreq
        else:
            logger.debug("Gave up retrying %(request)s (failed %(retries)d times): %(reason)s",
                         {'request': request, 'retries': retries, 'reason': reason},
                         extra={'spider': spider})

The Proxy_UA_Middlware class is quite long. Basically it contains methods that change proxy and user agent. I have both these middlewares configured properly in my settings.py file. The proxy part works okay but the User Agent doesn't change. The code I've used to changed User Agent looks like this

request.headers.setdefault('User-Agent', self.user_agent)

where self.user_agent is a random value taken from an array of user agents. This doesn't work. However, if I do this

request.headers['User-Agent'] = self.user_agent

then it works just fine and the user agent changes successfully for each retry. But I haven't seen anyone use this method to change the User Agent. My question is if changing the User Agent this way is okay and if not what am I doing wrong?

解决方案

If you always want to control which user-agent to use on that middleware, then it is ok, what setdefault does is to check if there is no User-Agent assigned before, which is possible because other middlewares could be doing it, or even assigning it from the spider.

Also I think you should also disable the default UserAgentMiddleware or even set a higher priority to your middleware, check that UserAgentMiddleware priority is 400, so set yours to be before (some number before 400).

这篇关于Scrapy - 在请求中更改用户代理的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆