设置Scrapy代理中间件以根据每个请求轮换 [英] Setting Scrapy proxy middleware to rotate on each request

查看:301
本文介绍了设置Scrapy代理中间件以根据每个请求轮换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个问题必然有两种形式,因为我不知道找到解决方案的更好途径.

This question necessarily comes in two forms, because I don't know the better route to a solution.

我正在爬网的网站经常将我踢到重定向的用户被阻止"页面,但频率(按请求/时间)似乎是随机的,并且它们似乎有一个黑名单阻止了我的许多开放"代理列表正在通过Proxymesh使用.所以...

A site I'm crawling kicks me to a redirected "User Blocked" page often, but the frequency (by requests/time) seems random, and they appear to have a blacklist blocking many of the "open" proxies list I'm using through Proxymesh. So...

  1. 当Scrapy收到对其请求的重定向"(例如DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm))时,它会继续尝试访问第544.htm页,还是将继续访问第545.htm页并永远永久下去?在第544.htm页上输了吗?如果它忘记"(或将其视为已访问),是否有办法告诉它继续重试该页面? (如果它自然地这样做,那么很好,并且很了解...)

  1. When Scrapy receives a "Redirect" to its request (e.g. DEBUG: Redirecting (302) to (GET http://.../you_got_blocked.aspx) from (GET http://.../page-544.htm)), does it continue to try to get to page-544.htm, or will it continue on to page-545.htm and forever lose out on page-544.htm? If it "forgets" (or counts it as visited), is there a way to tell it to keep retrying that page? (If it does that naturally, then yay, and good to know...)

最有效的解决方案是什么?

What is the most efficient solution?

(a)我目前正在做什么:通过http_proxy环境变量使用proxymesh旋转Proxy,该变量似乎经常轮换代理以至少相当定期地通过目标站点的重定向. (缺点:开放代理的ping速度很慢,代理服务器太多,proxymesh最终会在超过10个演出后向我收取每笔演出的费用,我只需要它们在重定向时旋转,我不知道.不知道它们轮流旋转的频率或触发方式,以及上述:我不知道正在重定向的页面是否被Scrapy重新排队...)(如果Proxymesh在每个请求上都在旋转,那么我可以支付合理的费用.)

(a) What I'm currently doing: using a proxymesh rotating Proxy through the http_proxy environment variable, which appears to rotate proxies often enough to at least fairly regularly get through the target site's redirections. (Downsides: the open proxies are slow to ping, there are only so many of them, proxymesh will eventually start charging me per gig past 10 gigs, I only need them to rotate when redirected, I don't know how often or on what trigger they rotate, and the above: I don't know if the pages I'm being redirected from are being re-queued by Scrapy...) (If Proxymesh is rotating on each request, then I'm okay with paying reasonable costs.)

(b)使用中间件在每个重定向上重新选择一个新的代理是否有意义(并且很简单)?那么每一个单独的请求呢?通过诸如TOR或Proxifier之类的东西会更有意义吗?如果这相对简单,我将如何设置?我已经在几个地方阅读过类似的内容,但是大多数链接已损坏或过时的Scrapy命令已过时.

(b) Would it make sense (and be simple) to use middleware to reselect a new proxy on each redirection? What about on every single request? Would that make more sense through something else like TOR or Proxifier? If this is relatively straightforward, how would I set it up? I've read something like this in a few places, but most are outdated with broken links or deprecated Scrapy commands.

作为参考,我确实已经为Proxy Mesh建立了中间件(是的,我使用的是http_proxy环境变量,但是当我遇到麻烦时,我是一个冗余的拥护者).所以这就是我目前所拥有的,以防万一:

For reference, I do have middleware currently set up for Proxy Mesh (yes, I'm using the http_proxy environment variable, but I'm a fan of redundancy when it comes to not getting in trouble). So this is what I have for that currently, in case that matters:

 class ProxyMiddleware(object):
  def process_request(self, request, spider):
    request.meta['proxy'] = "http://open.proxymesh.com:[port number]"

    proxy_user_pass = "username:password"
    encoded_user_pass = base64.encodestring(proxy_user_pass)
    request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

推荐答案

昨天,我在代理和针对DDoS的防护方面也有类似的任务. (我已经解析了一个网站) 这个想法在random.choice中.每个请求都有机会更改IP. Scrapy使用Tor和telnetlib3.您需要配置ControlPort密码.

yesterday I had similar task with proxy and protection against DDoS. ( I've parsed a site ) The idea is in random.choice. Every request has a chance of changing IP. Scrapy uses Tor and telnetlib3. You need to configure ControlPort password.

from scrapy import log
from settings import USER_AGENT_LIST

import random
import telnetlib
import time


# 15% ip change
class RetryChangeProxyMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 15:
            log.msg('Changing proxy')
            tn = telnetlib.Telnet('127.0.0.1', 9051)
            tn.read_until("Escape character is '^]'.", 2)
            tn.write('AUTHENTICATE "<PASSWORD HERE>"\r\n')
            tn.read_until("250 OK", 2)
            tn.write("signal NEWNYM\r\n")
            tn.read_until("250 OK", 2)
            tn.write("quit\r\n")
            tn.close()
            log.msg('>>>> Proxy changed. Sleep Time')
            time.sleep(10)



# 30% useragent change
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 30:
            log.msg('Changing UserAgent')
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
            log.msg('>>>> UserAgent changed')

这篇关于设置Scrapy代理中间件以根据每个请求轮换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆