以非清洁方式丢失的 Scrapy 扭曲连接.没有代理.已经尝试过标题 [英] Scrapy twisted connection lost in non-clean fashion. No proxy. Already tried headers

查看:79
本文介绍了以非清洁方式丢失的 Scrapy 扭曲连接.没有代理.已经尝试过标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试抓取此网站

https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

使用scrapy并不断收到扭曲的请求/断开连接错误.我没有使用代理,我尝试设置用户代理并实际设置基于 this answer 的所有标头>

这是生成请求的代码

def start_requests(self):url = 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs'标题 = {'接受': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','接受编码': 'gzip, deflate, br','接受语言': 'en-US,en;q=0.8','连接':'保持活动','DNT': '1','Host': 'www5.apply2jobs.com','推荐人':'https://www5.apply2jobs.com/jupitemed/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=2524&CurrentPage=2','升级不安全请求':'1','用户代理':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}产生请求(url=url,headers=headers,callback=self.parse)

这是我的回溯:

2017-08-28 13:34:13 [scrapy.core.engine] INFO:Spider 打开2017-08-28 13:34:13 [scrapy.extensions.logstats] 信息:抓取 0 页(以 0 页/分钟),抓取 0 个项目(以 0 个项目/分钟)2017-08-28 13:34:13 [scrapy.extensions.telnet] 调试:Telnet 控制台监听 127.0.0.1:60232017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] 调试:重试<GET https://www5.apply2jobs.com/robots.txt>(失败 1 次):[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] 调试:重试<GET https://www5.apply2jobs.com/robots.txt>(失败 2 次):[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净的方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] 调试:放弃重试<GET https://www5.apply2jobs.com/robots.txt>(失败 3 次):[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.downloadermiddlewares.robotstxt] 错误:下载错误<GET https://www5.apply2jobs.com/robots.txt>:[<twisted.python.failure.Failure Twisted.internet.error.ConnectionLost:与另一端的连接以非干净的方式丢失:连接丢失.>]ResponseNeverReceived:[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净的方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] 调试:重试(失败 1 次):[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] 调试:重试(失败 2 次):[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净的方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] 调试:放弃重试<GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs>(失败 3 次):[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.core.scraper] 错误:下载错误<GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs>:[<twisted.python.failure.Failuretwisted.internet.error.ConnectionLost:与另一端的连接以非干净的方式丢失:连接丢失.>]2017-08-28 13:34:13 [scrapy.core.engine] 信息:关闭蜘蛛(已完成)

解决方案

感谢关于 github 以及对我的问题的评论中,看起来最好的做法是使用带有加密的 virtualenv<2

感谢@paultrmbrth 的帮助

<块引用>

我尝试使用 enable-weak-ssl-ciphers 编译 OpenSSL 1.1.0f 来构建静态轮,但由于某种原因,我没有设法让它支持 TLS_RSA_WITH_RC4_128_MD5(如 ssllabs.com 报告的那样).我显然正在吸收 OpenSSL 构建知识.所以我看到的唯一选择是使用带有cryptography<2"的 virtualenv 来抓取该网站.

I am trying to crawl this site

https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

with scrapy and keep getting twisted request/disconnection errors. I am not using a proxy and I tried both setting the user agent and actually setting all the headers based on this answer

here is the code generating the request

def start_requests(self):
    url = 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs'

    headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive',
        'DNT': '1',
        'Host': 'www5.apply2jobs.com',
        'Referer': 'https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=2524&CurrentPage=2',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'
    }

    yield Request(url=url, headers=headers, callback=self.parse)

and this is my traceback:

2017-08-28 13:34:13 [scrapy.core.engine] INFO: Spider opened
2017-08-28 13:34:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-28 13:34:13 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www5.apply2jobs.com/robots.txt> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET https://www5.apply2jobs.com/robots.txt>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 1 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 2 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.core.scraper] ERROR: Error downloading <GET https://www5.apply2jobs.com/jupitermed/ProfExt/index.cfm?fuseaction=mExternal.searchJobs>: [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]
2017-08-28 13:34:13 [scrapy.core.engine] INFO: Closing spider (finished)

解决方案

So thanks to discussion on github as well as in the comments to my question, it looks like the best course of action is to use a virtualenv with cryptography<2

credit to @paultrmbrth for helping so much

I tried compiling OpenSSL 1.1.0f with enable-weak-ssl-ciphers to build a static wheel, but I didn't manage to have it support TLS_RSA_WITH_RC4_128_MD5 (as ssllabs.com reports) for some reason. I'm laking OpenSSL building knowledge apparently. So the only option I see is to use a virtualenv with 'cryptography<2' for scraping that website.

这篇关于以非清洁方式丢失的 Scrapy 扭曲连接.没有代理.已经尝试过标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆