Scrapy，限制 start_url [英] Scrapy, limit on start_url

查看：28 发布时间：2021/7/16 22:04:53 python scrapy

本文介绍了Scrapy，限制 start_url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想知道我可以分配给我的蜘蛛的 start_urls 的数量是否有限制?据我搜索，似乎没有关于列表限制的文档.

I am wondering whether there is a limit on the number of start_urls I can assign to my spider? As far as I've searched, there seems to be no documentation on the limit of the list.

目前我已经设置了我的蜘蛛，以便从 csv 文件中读取 start_urls 的列表.网址的数量约为 1,000,000.

Currently I have set my spider so that the list of start_urls is read in from a csv file. The number of urls is around 1,000,000.

推荐答案

本身没有限制，但您可能想自己限制它，否则最终可能会出现内存问题.
可能发生的情况是所有 100 万个 url 都将被安排到scrapy 调度程序，并且由于 python 对象比普通字符串重得多，你最终会耗尽内存.

There isn't a limit per se but you probably want to limit it yourself, otherwise you might end up with memory problems.
What can happen is all those 1M urls will be scheduled to scrapy scheduler and since python objects are quite a bit heavier than plain strings you'll end up running out of memory.

为避免这种情况，您可以使用 spider_idle 信号批量处理起始网址:

To avoid this you can batch your start urls with spider_idle signal:

class MySpider(Spider):
    name = "spider"
    urls = []
    batch_size = 10000

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = cls(crawler, *args, **kwargs)
        crawler.signals.connect(spider.idle_consume, signals.spider_idle)
        return spider 

    def __init__(self, crawler):
        self.crawler = crawler
        self.urls = [] # read from file

    def start_requests(self):
        for i in range(self.batch_size):
            url = self.urls.pop(0)
            yield Request(url)


    def parse(self, response):
        pass
        # parse

    def idle_consume(self):
        """
        Everytime spider is about to close check our urls 
        buffer if we have something left to crawl
        """
        reqs = self.start_requests()
        if not reqs:
            return
        logging.info('Consuming batch')
        for req in reqs:
            self.crawler.engine.schedule(req, self)
        raise DontCloseSpider

这篇关于Scrapy，限制 start_url的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Scrapy，限制 start_url [英] Scrapy, limit on start_url

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy，限制 start_url [英] Scrapy, limit on start_url

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭