Scrapy 没有抓取所有页面 [英] Scrapy not crawling all the pages

查看：47 发布时间：2021/7/16 22:07:20 python scrapy

本文介绍了Scrapy 没有抓取所有页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试以非常基本的方式抓取网站.但是 Scrapy 并没有抓取所有链接.我将解释如下场景-

I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-

main_page.html -> 包含指向 a_page.html、b_page.html、c_page.html 的链接
a_page.html -> 包含指向 a1_page.html、a2_page.html
的链接b_page.html -> 包含指向 b1_page.html、b2_page.html
的链接c_page.html -> 包含指向 c1_page.html、c2_page.html
的链接a1_page.html -> 包含指向 b_page.html
的链接a2_page.html -> 包含指向 c_page.html
的链接b1_page.html -> 包含指向 a_page.html
的链接b2_page.html -> 包含指向 c_page.html
的链接c1_page.html -> 包含指向 a_page.html
的链接c2_page.html -> 包含指向 main_page.html

main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html

我在 CrawlSpider 中使用以下规则 -

I am using the following rule in CrawlSpider -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

但是爬取结果如下——

DEBUG: Crawled (200) http://localhost/main_page.html> (referer:无) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer:http://localhost/main_page.html) 2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/a1_page.html>(参考:http://localhost/a_page.html)2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/b_page.html>(参考:http://localhost/a1_page.html)2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/b1_page.html>(参考:http://localhost/b_page.html)2011-12-05 09:56:07+0530[test_spider] 信息:关闭蜘蛛(已完成)

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)

它不会抓取所有页面.

注意 - 我已经按照 Scrapy Doc 中的说明在 BFO 中进行了爬行.

NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.

我错过了什么?

Scrapy 没有抓取所有页面 [英] Scrapy not crawling all the pages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy 没有抓取所有页面 [英] Scrapy not crawling all the pages

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭