在scrapy中使用try/except子句无法获得想要的结果 [英] Can't get desired results using try/except clause within scrapy

查看:38
本文介绍了在scrapy中使用try/except子句无法获得想要的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 scrapy 中编写了一个脚本,通过 get_proxies() 方法使用新生成的代理发出代理请求.我使用 requests 模块来获取代理,以便在脚本中重用它们.我想要做的是解析 登陆页面 中的所有电影链接,然后获取名称每部电影的目标页面.我的以下脚本可以使用代理轮换.

I've written a script in scrapy to make proxied requests using newly generated proxies by get_proxies() method. I used requests module to fetch the proxies in order to reuse them in the script. What I'm trying to do is parse all the movie links from it's landing page and then fetch the name of each movie from it's target page. My following script can use rotation of proxies.

我知道有一种更简单的方法来更改代理,就像这里描述的那样HttpProxyMiddleware 但我仍然想坚持我在这里尝试的方式.

I know there is an easier way to change proxies, like it is described here HttpProxyMiddleware but I would still like to stick to the way I'm trying here.

网站链接

这是我目前的尝试(它一直使用新的代理来获取有效的响应,但每次它都会收到 503 Service Unavailable):

This is my current attempt (It keeps using new proxies to fetch a valid response but every time it gets 503 Service Unavailable):

import scrapy
import random
import requests
from itertools import cycle
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_proxies():   
    response = requests.get("https://www.us-proxy.org/")
    soup = BeautifulSoup(response.text,"lxml")
    proxy = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
    return proxy

class ProxySpider(scrapy.Spider):
    name = "proxiedscript"
    handle_httpstatus_list = [503]
    proxy_vault = get_proxies()
    check_url = "https://yts.am/browse-movies"

    def start_requests(self):
        random.shuffle(self.proxy_vault)
        proxy_url = next(cycle(self.proxy_vault))
        request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
        request.meta['https_proxy'] = f'http://{proxy_url}'
        yield request

    def parse(self,response):
        print(response.meta)
        if "DDoS protection by Cloudflare" in response.css(".attribution > a::text").get():
            random.shuffle(self.proxy_vault)
            proxy_url = next(cycle(self.proxy_vault))
            request = scrapy.Request(self.check_url,callback=self.parse,dont_filter=True)
            request.meta['https_proxy'] = f'http://{proxy_url}'
            yield request

        else:
            for item in response.css(".browse-movie-wrap a.browse-movie-title::attr(href)").getall():
                nlink = response.urljoin(item)
                yield scrapy.Request(nlink,callback=self.parse_details)

    def parse_details(self,response):
        name = response.css("#movie-info h1::text").get()
        yield {"Name":name}

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(ProxySpider)
    c.start()

为了确定请求是否被代理,我打印了 response.meta 并且可以得到这样的结果 {'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.

To make sure whether the request is being proxied, I printed response.meta and could get results like this {'https_proxy': 'http://142.93.127.126:3128', 'download_timeout': 180.0, 'download_slot': 'yts.am', 'download_latency': 0.237013578414917, 'retry_times': 2, 'depth': 0}.

由于我过度使用链接来检查scrapy 中的代理请求是如何工作的,此时我收到503 Service Unavailable 错误,我可以在响应 中看到这个关键字Cloudflare 的 DDoS 保护.但是,当我尝试使用 requests 模块应用我在此处实现的相同逻辑时,我得到了有效响应.

As I've overused the link to check how the proxied request within scrapy works, I'm getting 503 Service Unavailable error at this moment and I can see this keyword within the response DDoS protection by Cloudflare. However, I get valid response when I try with requests module applying the same logic I implemented here.

我之前的问题:为什么我无法得到有效的响应,因为(我想)我以正确的方式使用代理?[已解决]

悬赏问题:如何在我的脚本中定义 try/except 子句,以便它在与某个代理发生连接错误时尝试使用不同的代理?

Bounty Question: how can I define try/except clause within my script so that it will try with different proxies once it throws connection error with a certain proxy?

推荐答案

根据 scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs(和来源)
proxy 元密钥应该使用(不是 https_proxy)

According to scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware docs (and source)
proxy meta key is expected to use (not https_proxy)

#request.meta['https_proxy'] = f'http://{proxy_url}'  
request.meta['proxy'] = f'http://{proxy_url}'

由于scrapy 没有收到有效的元密钥 - 您的scrapy 应用程序没有使用代理

As scrapy didn't received valid meta key - your scrapy application didn't use proxies

这篇关于在scrapy中使用try/except子句无法获得想要的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆