Scrapy 错误:用户超时导致连接失败 [英] Scrapy error:User timeout caused connection failure

查看:107
本文介绍了Scrapy 错误:用户超时导致连接失败的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 scrapy 抓取阿迪达斯网站:http://www.adidas.com/us/men-shoes.但它总是显示错误:

<块引用>

用户超时导致连接失败:获取

通过发出相同的请求但没有 Connection: close 标头对 fiddler 进行测试后,我得到了正确的响应.现在的问题是如何去除 Connection: close 标头?

解决方案

由于scrapy 不允许您编辑Connection: close 标题.我使用了scrapy-splash 来使用splash 发出请求.

现在可以覆盖 Connection: close 标头并且一切正常.缺点是现在网页必须在我从初始响应之前加载所有资产,速度较慢但有效.

Scrapy 应该添加选项来编辑它们的默认 Connection: close 标题.它在库中被硬编码,不能轻易覆盖.

以下是我的工作代码:

headers = {"接受": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Language": "en-US,en;q=0.9","Host": "www.adidas.com","连接": "保持活动",升级不安全请求":1","用户代理": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}def start_requests(self):url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"yield SplashRequest(url, self.parse, headers=self.headers)

I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes. But it always shows error:

User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..

It retries for 5 times and then fails completely.

I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.

Below is my code:

import scrapy


class AdidasSpider(scrapy.Spider):
    name = "adidas"

    def start_requests(self):

        urls = ['http://www.adidas.com/us/men-shoes']

        headers = {
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "en-US,en;q=0.9",
            "Cache-Control": "max-age=0",
            "Connection": "keep-alive",
            "Host": "www.adidas.com",
            "Upgrade-Insecure-Requests": "1",
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
        }

        for url in urls:
            yield scrapy.Request(url, self.parse, headers=headers)

    def parse(self, response):
        yield(response.body)

Scrapy log:

{'downloader/exception_count': 1,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'downloader/request_bytes': 224,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'finish_reason': 'shutdown',
 'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 9,
 'retry/count': 1,
 'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}

Update

After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close header by default due to which I'm not getting any response from the adidas site.

After testing on fiddler by making the same request but without the Connection: close header, I got the response correctly. Now the problem is how to remove the Connection: close header?

解决方案

As scrapy doesn't let you to edit the Connection: close header. I used scrapy-splash instead to make the requests using splash.

Now the Connection: close header can be overidden and everythings working now. The downside is that now the web page has to load all the the assets before I get the response from splash, slower but works.

Scrapy should add the option to edit their default Connection: close header. It is hardcoded in the library and cannot be overidden easily.

Below is my working code:

headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Host": "www.adidas.com",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
    }

    def start_requests(self):
        url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"
        yield SplashRequest(url, self.parse, headers=self.headers)

这篇关于Scrapy 错误:用户超时导致连接失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆