Scrapy 错误:用户超时导致连接失败 [英] Scrapy error:User timeout caused connection failure
问题描述
我正在使用 scrapy 抓取阿迪达斯网站:http://www.adidas.com/us/men-shoes
.但它总是显示错误:
用户超时导致连接失败:获取
通过发出相同的请求但没有 Connection: close
标头对 fiddler 进行测试后,我得到了正确的响应.现在的问题是如何去除 Connection: close
标头?
由于scrapy 不允许您编辑Connection: close
标题.我使用了scrapy-splash 来使用splash 发出请求.
现在可以覆盖 Connection: close
标头并且一切正常.缺点是现在网页必须在我从初始响应之前加载所有资产,速度较慢但有效.
Scrapy 应该添加选项来编辑它们的默认 Connection: close
标题.它在库中被硬编码,不能轻易覆盖.
以下是我的工作代码:
headers = {"接受": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-Language": "en-US,en;q=0.9","Host": "www.adidas.com","连接": "保持活动",升级不安全请求":1","用户代理": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"}def start_requests(self):url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"yield SplashRequest(url, self.parse, headers=self.headers)
I'm using scrapy to scrape the adidas site: http://www.adidas.com/us/men-shoes
.
But it always shows error:
User timeout caused connection failure: Getting http://www.adidas.com/us/men-shoes took longer than 180.0 seconds..
It retries for 5 times and then fails completely.
I can access the url on chrome but it's not working on scrapy.
I've tried using custom user agents and emulating header requests but It's still doesn't work.
Below is my code:
import scrapy
class AdidasSpider(scrapy.Spider):
name = "adidas"
def start_requests(self):
urls = ['http://www.adidas.com/us/men-shoes']
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Host": "www.adidas.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
for url in urls:
yield scrapy.Request(url, self.parse, headers=headers)
def parse(self, response):
yield(response.body)
Scrapy log:
{'downloader/exception_count': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 224,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2018, 1, 25, 10, 59, 35, 57000),
'log_count/DEBUG': 2,
'log_count/INFO': 9,
'retry/count': 1,
'retry/reason_count/twisted.web._newclient.ResponseNeverReceived': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 1, 25, 10, 58, 39, 550000)}
Update
After looking at the request headers using fiddler and doing some tests I found what was causing the issue. Scrapy is sending a Connection: close
header by default due to which I'm not getting any response from the adidas site.
After testing on fiddler by making the same request but without the Connection: close
header, I got the response correctly. Now the problem is how to remove the Connection: close
header?
As scrapy doesn't let you to edit the Connection: close
header. I used scrapy-splash instead to make the requests using splash.
Now the Connection: close
header can be overidden and everythings working now. The downside is that now the web page has to load all the the assets before I get the response from splash, slower but works.
Scrapy should add the option to edit their default Connection: close
header. It is hardcoded in the library and cannot be overidden easily.
Below is my working code:
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Host": "www.adidas.com",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
def start_requests(self):
url = "http://www.adidas.com/us/men-shoes?sz=120&start=0"
yield SplashRequest(url, self.parse, headers=self.headers)
这篇关于Scrapy 错误:用户超时导致连接失败的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!