Scrapy 没有抓取所有页面 [英] Scrapy not crawling all the pages

查看:47
本文介绍了Scrapy 没有抓取所有页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试以非常基本的方式抓取网站.但是 Scrapy 并没有抓取所有链接.我将解释如下场景-

I am trying to crawl sites in a very basic manner. But Scrapy isn't crawling all the links. I will explain the scenario as follows-

main_page.html -> 包含指向 a_page.html、b_page.html、c_page.html 的链接
a_page.html -> 包含指向 a1_page.html、a2_page.html
的链接b_page.html -> 包含指向 b1_page.html、b2_page.html
的链接c_page.html -> 包含指向 c1_page.html、c2_page.html
的链接a1_page.html -> 包含指向 b_page.html
的链接a2_page.html -> 包含指向 c_page.html
的链接b1_page.html -> 包含指向 a_page.html
的链接b2_page.html -> 包含指向 c_page.html
的链接c1_page.html -> 包含指向 a_page.html
的链接c2_page.html -> 包含指向 main_page.html

main_page.html -> contains links to a_page.html, b_page.html, c_page.html
a_page.html -> contains links to a1_page.html, a2_page.html
b_page.html -> contains links to b1_page.html, b2_page.html
c_page.html -> contains links to c1_page.html, c2_page.html
a1_page.html -> contains link to b_page.html
a2_page.html -> contains link to c_page.html
b1_page.html -> contains link to a_page.html
b2_page.html -> contains link to c_page.html
c1_page.html -> contains link to a_page.html
c2_page.html -> contains link to main_page.html

我在 CrawlSpider 中使用以下规则 -

I am using the following rule in CrawlSpider -

Rule(SgmlLinkExtractor(allow = ()), callback = 'parse_item', follow = True))

但是爬取结果如下——

DEBUG: Crawled (200) http://localhost/main_page.html> (referer:无) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer:http://localhost/main_page.html) 2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/a1_page.html>(参考:http://localhost/a_page.html)2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/b_page.html>(参考:http://localhost/a1_page.html)2011-12-05 09:56:07+0530[test_spider] 调试:爬网 (200) http://localhost/b1_page.html>(参考:http://localhost/b_page.html)2011-12-05 09:56:07+0530[test_spider] 信息:关闭蜘蛛(已完成)

DEBUG: Crawled (200) http://localhost/main_page.html> (referer: None) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a_page.html> (referer: http://localhost/main_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/a1_page.html> (referer: http://localhost/a_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b_page.html> (referer: http://localhost/a1_page.html) 2011-12-05 09:56:07+0530 [test_spider] DEBUG: Crawled (200) http://localhost/b1_page.html> (referer: http://localhost/b_page.html) 2011-12-05 09:56:07+0530 [test_spider] INFO: Closing spider (finished)

它不会抓取所有页面.

注意 - 我已经按照 Scrapy Doc 中的说明在 BFO 中进行了爬行.

NB - I have made the crawling in BFO as it was indicated in the Scrapy Doc.

我错过了什么?

推荐答案

我今天遇到了类似的问题,尽管我使用的是自定义蜘蛛.结果是该网站限制了我的抓取,因为我的 useragent 是scrapy-bot

I had a similar problem today, although I was using a custom spider. It turned out that the website was limiting my crawl because my useragent was scrappy-bot

尝试更改您的用户代理,然后重试.将其更改为可能的已知浏览器

try changing your user agent and try again. Change it to maybe that of a known browser

您可能想尝试的另一件事是添加延迟.如果请求之间的时间太短,一些网站会阻止抓取.尝试添加 DOWNLOAD_DELAY 为 2,看看是否有帮助

Another thing you might want to try is adding a delay. Some websites prevent scraping if the time between request is too small. Try adding a DOWNLOAD_DELAY of 2 and see if that helps

有关 DOWNLOAD_DELAY 的更多信息,请访问http://doc.scrapy.org/en/0.14/topics/settings.html

More information about DOWNLOAD_DELAY at http://doc.scrapy.org/en/0.14/topics/settings.html

这篇关于Scrapy 没有抓取所有页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆