在 Scrapy 中爬行的顺序 [英] Order of crawling in Scrapy

查看:49
本文介绍了在 Scrapy 中爬行的顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在scrapy中编写了一个基本的CrawlSpider,但我想了解抓取网址的顺序是什么 - FIFO/LIFO?

I wrote a basic CrawlSpider in scrapy but I want to understand what is the order in which the urls are crawled - FIFO / LIFO?

我希望爬虫应该抓取起始 url 页面中的所有链接,然后移动到其他似乎不是顺序的 URL.

I want that the crawler should crawl all the links in the start url page and then move on to other URLs which does not seem to be the order.

我该怎么做?

推荐答案

默认情况下,Scrapy 使用 LIFO 队列来存储待处理的请求,这基本上意味着它以 DFO 顺序爬行.这个订单更多大多数情况下很方便.如果您确实想以真正的 BFO 顺序爬行,您可以通过设置以下设置来实现:

By default, Scrapy uses a LIFO queue for storing pending requests, which basically means that it crawls in DFO order. This order is more convenient in most cases. If you do want to crawl in true BFO order, you can do it by setting the following settings:

 DEPTH_PRIORITY = 1
 SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
 SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

这篇关于在 Scrapy 中爬行的顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆