设置 Scrapy 代理中间件在每个请求上轮换 [英] Setting Scrapy proxy middleware to rotate on each request

查看:28
本文介绍了设置 Scrapy 代理中间件在每个请求上轮换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题必然有两种形式,因为我不知道解决问题的更好途径.

This question necessarily comes in two forms, because I don't know the better route to a solution.

我正在爬行的网站经常将我踢到重定向的用户被阻止"页面,但频率(按请求/时间)似乎是随机的,而且他们似乎有一个黑名单阻止了我的许多开放"代理列表正在通过 Proxymesh 使用.所以...

A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...

  1. 当 Scrapy 收到对其请求的重定向"时(例如 DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)),它会继续尝试到达 page-544.htm,还是会继续到达 page-545.htm 并永远失去 page-544.htm?如果它忘记"(或将其计为已访问),是否有办法告诉它继续重试该页面?(如果它自然而然地这样做,那么是的,很高兴知道...)

  1. When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)

最有效的解决方案是什么?

What is the most efficient solution?

(a) 我目前在做什么:通过 http_proxy 环境变量使用 proxymesh 旋转代理,它似乎经常旋转代理,至少可以相当规律地通过目标站点的重定向.(缺点:开放代理的 ping 速度很慢,只有这么多,proxymesh 最终会开始按每场演出收费 10 场演出,我只需要它们在重定向时旋转,我不不知道它们旋转的频率或触发器,以及以上内容:我不知道我被重定向的页面是否被 Scrapy 重新排队......)(如果 Proxymesh 在每个请求上旋转,那么我可以支付合理的费用.)

(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)

(b) 使用中间件在每次重定向时重新选择一个新代理是否有意义(并且很简单)?每一个请求呢?通过 TOR 或 Proxifier 之类的其他东西会更有意义吗?如果这相对简单,我将如何设置?我在几个地方读到过类似的内容,但大多数已经过时,链接损坏或 Scrapy 命令已弃用.

(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.

作为参考,我目前确实为 Proxy Mesh 设置了中间件(是的,我正在使用 http_proxy 环境变量,但我喜欢冗余,以免遇到麻烦).所以这就是我目前所拥有的,以防万一:

For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:

 class ProxyMiddleware(object):
  def process_request(self, request, spider):
    request.meta['proxy'] = "http://open.proxymesh.com:[port number]"

    proxy_user_pass = "username:password"
    encoded_user_pass = base64.encodestring(proxy_user_pass)
    request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

推荐答案

昨天我在代理和 DDoS 防护方面做了类似的任务.(我已经解析了一个站点)这个想法在 random.choice 中.每个请求都有机会改变 IP.Scrapy 使用 Tor 和 telnetlib3.您需要配置 ControlPort 密码.

yesterday I had similar task with proxy and protection against DDoS. ( I've parsed a site ) The idea is in random.choice. Every request has a chance of changing IP. Scrapy uses Tor and telnetlib3. You need to configure ControlPort password.

from scrapy import log
from settings import USER_AGENT_LIST

import random
import telnetlib
import time


# 15% ip change
class RetryChangeProxyMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 15:
            log.msg('Changing proxy')
            tn = telnetlib.Telnet('127.0.0.1', 9051)
            tn.read_until("Escape character is '^]'.", 2)
            tn.write('AUTHENTICATE "<PASSWORD HERE>"
')
            tn.read_until("250 OK", 2)
            tn.write("signal NEWNYM
")
            tn.read_until("250 OK", 2)
            tn.write("quit
")
            tn.close()
            log.msg('>>>> Proxy changed. Sleep Time')
            time.sleep(10)



# 30% useragent change
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 30:
            log.msg('Changing UserAgent')
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
            log.msg('>>>> UserAgent changed')

这篇关于设置 Scrapy 代理中间件在每个请求上轮换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆