设置 Scrapy 代理中间件在每个请求上轮换 [英] Setting Scrapy proxy middleware to rotate on each request
问题描述
这个问题必然有两种形式,因为我不知道解决问题的更好途径.
This question necessarily comes in two forms, because I don't know the better route to a solution.
我正在爬行的网站经常将我踢到重定向的用户被阻止"页面,但频率(按请求/时间)似乎是随机的,而且他们似乎有一个黑名单阻止了我的许多开放"代理列表正在通过 Proxymesh 使用.所以...
A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...
当 Scrapy 收到对其请求的重定向"时(例如
DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)
),它会继续尝试到达 page-544.htm,还是会继续到达 page-545.htm 并永远失去 page-544.htm?如果它忘记"(或将其计为已访问),是否有办法告诉它继续重试该页面?(如果它自然而然地这样做,那么是的,很高兴知道...)
When Scrapy receives a "Redirect" to its request (e.g.
DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)
), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)
最有效的解决方案是什么?
What is the most efficient solution?
(a) 我目前在做什么:通过 http_proxy 环境变量使用 proxymesh 旋转代理,它似乎经常旋转代理,至少可以相当规律地通过目标站点的重定向.(缺点:开放代理的 ping 速度很慢,只有这么多,proxymesh 最终会开始按每场演出收费 10 场演出,我只需要它们在重定向时旋转,我不不知道它们旋转的频率或触发器,以及以上内容:我不知道我被重定向的页面是否被 Scrapy 重新排队......)(如果 Proxymesh 在每个请求上旋转,那么我可以支付合理的费用.)
(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)
(b) 使用中间件在每次重定向时重新选择一个新代理是否有意义(并且很简单)?每一个请求呢?通过 TOR 或 Proxifier 之类的其他东西会更有意义吗?如果这相对简单,我将如何设置?我在几个地方读到过类似的内容,但大多数已经过时,链接损坏或 Scrapy 命令已弃用.
(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.
作为参考,我目前确实为 Proxy Mesh 设置了中间件(是的,我正在使用 http_proxy 环境变量,但我喜欢冗余,以免遇到麻烦).所以这就是我目前所拥有的,以防万一:
For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://open.proxymesh.com:[port number]"
proxy_user_pass = "username:password"
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
推荐答案
昨天我在代理和 DDoS 防护方面做了类似的任务.(我已经解析了一个站点)这个想法在 random.choice
中.每个请求都有机会改变 IP.Scrapy 使用 Tor 和 telnetlib3.您需要配置 ControlPort 密码.
yesterday I had similar task with proxy and protection against DDoS. ( I've parsed a site )
The idea is in random.choice
. Every request has a chance of changing IP.
Scrapy uses Tor and telnetlib3. You need to configure ControlPort password.
from scrapy import log
from settings import USER_AGENT_LIST
import random
import telnetlib
import time
# 15% ip change
class RetryChangeProxyMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 15:
log.msg('Changing proxy')
tn = telnetlib.Telnet('127.0.0.1', 9051)
tn.read_until("Escape character is '^]'.", 2)
tn.write('AUTHENTICATE "<PASSWORD HERE>"
')
tn.read_until("250 OK", 2)
tn.write("signal NEWNYM
")
tn.read_until("250 OK", 2)
tn.write("quit
")
tn.close()
log.msg('>>>> Proxy changed. Sleep Time')
time.sleep(10)
# 30% useragent change
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
if random.choice(xrange(1,100)) <= 30:
log.msg('Changing UserAgent')
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
log.msg('>>>> UserAgent changed')
这篇关于设置 Scrapy 代理中间件在每个请求上轮换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!