无法一一使用代理,直到有有效响应 [英] Unable to use proxies one by one until there is a valid response

查看:76
本文介绍了无法一一使用代理,直到有有效响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 python 的 scrapy 中编写了一个脚本,通过 get_proxies() 方法使用新生成的代理中的任何一个来发出代理请求.我使用 requests 模块来获取代理,以便在脚本中重用它们.然而,问题是我的脚本选择使用的代理可能并不总是好的,所以有时它不会获取有效的响应.

I've written a script in python's scrapy to make a proxied requests using either of the newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. However, the problem is the proxy my script chooses to use may not be the good one always so sometimes it doesn't fetch valid response.

如何让我的脚本不断尝试使用不同的代理,直到得到有效响应?

到目前为止我的脚本:

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.http.request import Request
from scrapy.crawler import CrawlerProcess

class ProxySpider(scrapy.Spider):
    name = "sslproxies"
    check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
    proxy_link = "https://www.sslproxies.org/"

    def start_requests(self):
        proxylist = self.get_proxies()
        random.shuffle(proxylist)
        proxy_ip_port = next(cycle(proxylist))
        print(proxy_ip_port)       #Checking out the proxy address
        request = scrapy.Request(self.check_url, callback=self.parse,errback=self.errback_httpbin,dont_filter=True)
        request.meta['proxy'] = "http://{}".format(proxy_ip_port)
        yield request

    def get_proxies(self):   
        response = requests.get(self.proxy_link)
        soup = BeautifulSoup(response.text,"lxml")
        proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
        return proxy

    def parse(self, response):
        print(response.meta.get("proxy"))  #Compare this to the earlier one whether they both are the same

    def errback_httpbin(self, failure):
        print("Failure: "+str(failure))

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0', 
        'DOWNLOAD_TIMEOUT' : 5,  
    })
    c.crawl(ProxySpider)
    c.start()

PS 我的意图是按照我从这里开始的方式寻求任何解决方案.

推荐答案

众所周知,http 响应需要通过所有中间件才能到达蜘蛛方法.

As we know http response needs to pass all middlewares in order to reach spider methods.

这意味着只有具有有效代理的请求才能进行蜘蛛回调函数.

It means that only requests with valid proxies can proceed to spider callback functions.

为了使用有效代理,我们需要先检查所有代理,然后仅从有效代理中选择.

In order to use valid proxies we need to check ALL proxies first and after that choose only from valid proxies.

当我们之前选择的代理不再工作时 - 我们将此代理标记为无效,并从蜘蛛 errback 中剩余的有效代理中选择新代理.

When our previously chosen proxy doesn't work anymore - we mark this proxy as not valid and choose new one from remaining valid proxies in spider errback.

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.http.request import Request

class ProxySpider(scrapy.Spider):
    name = "sslproxies"
    check_url = "https://stackoverflow.com/questions/tagged/web-scraping"
    proxy_link = "https://www.sslproxies.org/"
    current_proxy = ""
    proxies = {}

    def start_requests(self):
        yield Request(self.proxy_link,callback=self.parse_proxies)

    def parse_proxies(self,response):

        for row in response.css("table#proxylisttable tbody tr"):
             if "yes" in row.extract():
                 td = row.css("td::text").extract()
                 self.proxies["http://{}".format(td[0]+":"+td[1])]={"valid":False}

        for proxy in self.proxies.keys():
             yield Request(self.check_url,callback=self.parse,errback=self.errback_httpbin,
                           meta={"proxy":proxy,
                                 "download_slot":proxy},
                           dont_filter=True)

    def parse(self, response):
        if "proxy" in response.request.meta.keys():
            #As script reaches this parse method we can mark current proxy as valid
            self.proxies[response.request.meta["proxy"]]["valid"] = True
            print(response.meta.get("proxy"))
            if not self.current_proxy:
                #Scraper reaches this code line on first valid response
                self.current_proxy = response.request.meta["proxy"]
                #yield Request(next_url, callback=self.parse_next,
                #              meta={"proxy":self.current_proxy,
                #                    "download_slot":self.current_proxy})

    def errback_httpbin(self, failure):
        if "proxy" in failure.request.meta.keys():
            proxy = failure.request.meta["proxy"]
            if proxy == self.current_proxy:
                #If current proxy after our usage becomes not valid
                #Mark it as not valid
                self.proxies[proxy]["valid"] = False
                for ip_port in self.proxies.keys():
                    #And choose valid proxy from self.proxies
                    if self.proxies[ip_port]["valid"]:
                        failure.request.meta["proxy"] = ip_port
                        failure.request.meta["download_slot"] = ip_port
                        self.current_proxy = ip_port
                        return failure.request
        print("Failure: "+str(failure))

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'COOKIES_ENABLED': False,
        'DOWNLOAD_TIMEOUT' : 10,
        'DOWNLOAD_DELAY' : 3,
    })
    c.crawl(ProxySpider)
    c.start()

这篇关于无法一一使用代理,直到有有效响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆