Portia Spider 日志在爬行期间显示 ['Partial'] [英] Portia Spider logs showing ['Partial'] during crawling

查看：54 发布时间：2021/7/16 21:53:11 python web-scraping scrapy scrapyd portia

本文介绍了Portia Spider 日志在爬行期间显示 ['Partial']的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用 Portia 网络爬虫创建了一个蜘蛛，起始 URL 是

https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

在scrapyd中调度这个蜘蛛时，我得到了

DEBUG: Crawled (200) (参考:无)['部分']调试:爬行 (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2>(参考:https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs)['部分']调试:爬行 (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1>(参考:https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs)['部分']`<br><br>

['partial'] 是什么意思，为什么页面内容没有被 spdier 抓取?

解决方案

迟到的答案，但希望不是没用，因为scrapy 的这种行为似乎没有得到很好的记录.看看这个 line of> 从scrapy 源中，当请求遇到Twisted PotentialDataLoss 错误时，会设置partial 标志.根据相应的 Twisted 文档:><块引用>

这仅在向 HTTP 服务器发出请求时发生，而 HTTP 服务器未在响应中设置 Content-Length 或 Transfer-Encoding

可能的原因包括:

服务器配置错误
有一个代理阻止了一些标头
您收到的响应通常没有 Content-Length，例如重定向 (301, 302, 303)，但是您已经设置了 handle_httpstatus_list 或 handle_httpstatus_all，这样响应就不会被 HttpErrorMiddleware 过滤掉或被 RedirectMiddleware 获取

I have created a spider using Portia web scraper and the start URL is

https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs

While scheduling this spider in scrapyd I am getting

DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs> (referer: None) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']
DEBUG: Crawled (200) <GET https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.showJob&RID=21805&CurrentPage=1> (referer: https://www1.apply2jobs.com/EdwardJonesCareers/ProfExt/index.cfm?fuseaction=mExternal.searchJobs) ['partial']`<br><br>

What does the ['partial'] mean and why the content from the page is not scraped by the spdier?

解决方案

Late answer, but hopefully not useless, since this behavior by scrapy doesn't seem well-documented. Looking at this line of code from the scrapy source, the partial flag is set when the request encounters a Twisted PotentialDataLoss error. According to the corresponding Twisted documentation:

This only occurs when making requests to HTTP servers which do not set Content-Length or a Transfer-Encoding in the response

Possible causes include:

The server is misconfigured
There's a proxy involved that's blocking some headers
You get a response that doesn't normally have Content-Length, e.g. redirects (301, 302, 303), but you've set handle_httpstatus_list or handle_httpstatus_all such that the response doesn't get filtered out by HttpErrorMiddleware or fetched by RedirectMiddleware

这篇关于Portia Spider 日志在爬行期间显示 ['Partial']的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Portia Spider 日志在爬行期间显示 ['Partial'] [英] Portia Spider logs showing ['Partial'] during crawling

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Portia Spider 日志在爬行期间显示 ['Partial'] [英] Portia Spider logs showing [&#39;Partial&#39;] during crawling

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Portia Spider 日志在爬行期间显示 ['Partial'] [英] Portia Spider logs showing ['Partial'] during crawling

登录关闭