“download_slot"如何在scrapy中工作 [英] How "download_slot" works within scrapy

查看:30
本文介绍了“download_slot"如何在scrapy中工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在scrapy中创建了一个脚本来解析来自其登陆页面的不同帖子的作者姓名,然后使用meta将其传递给parse_page方法 关键字,以便同时打印 post content作者姓名.

I'v created a script in scrapy to parse the author name of different posts from it's landing page and then pass it to the parse_page method using meta keyword in order to print the post content along with the author name at the same time.

我在 meta 关键字中使用了 download_slot,据称这会掩盖脚本运行得更快.虽然没有必要遵守我在这里尝试应用的逻辑,但我想坚持它只是为了了解 download_slot 在任何脚本中的工作原理以及原因.我搜索了很多关于 download_slot 的信息,但我最终找到了一些链接,例如 这个.

I've used download_slot within meta keyword which allegedly maskes the script run faster. Although it is not necessary to comply with the logic I tried to apply here, I would like to stick to it only to understand how download_slot works within any script and why. I searched a lot to know more about download_slot but I end up some links like this one.

download_slot 的示例用法(不过我不太确定):

An example usage of download_slot (I'm not quite sure about it though):

from scrapy.crawler import CrawlerProcess
from scrapy import Request
import scrapy

class ConventionSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['https://stackoverflow.com/questions/tagged/web-scraping']

    def parse(self,response):
        for link in response.css('.summary'):
            name = link.css('.user-details a::text').extract_first()
            url = link.css('.question-hyperlink::attr(href)').extract_first()
            nurl = response.urljoin(url)
            yield Request(nurl,callback=self.parse_page,meta={'item':name,"download_slot":name})

    def parse_page(self,response):
        elem = response.meta.get("item")
        post = ' '.join([item for item in response.css("#question .post-text p::text").extract()])
        yield {'Name':elem,'Main_Content':post}

if __name__ == "__main__":
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
    })
    process.crawl(ConventionSpider)
    process.start()

上述脚本完美运行.

我的问题:download_slot 如何在 scrapy 中工作?

My question: how download_slot works within scrapy?

推荐答案

让我们从 Scrapy 架构.当您创建 scrapy.Request 时,Scrapy 引擎会将请求传递给下载器以获取内容.下载器将传入的请求放入槽中,您可以将其想象为独立的请求队列.然后轮询队列并处理每个单独的请求(下载内容).

Let's start with the Scrapy architecture. When you create a scrapy.Request, the Scrapy engine passes the request to the downloader to fetch the content. The downloader puts incoming requests into slots which you can imagine as independent queues of requests. The queues are then polled and each individual request gets processed (the content gets downloaded).

现在,这是关键部分.为了确定将传入请求放入哪个槽,下载程序检查 request.metadownload_slot 键.如果存在,它将请求放入具有该名称的插槽中(如果尚不存在则创建它).如果 download_slot 键不存在,它会将请求放入请求 URL 指向的域(更准确地说是主机名)的槽中.

Now, here's the crucial part. To determine into what slot to put the incoming request, downloader checks request.meta for download_slot key. If it's present, it puts the request into the slot with that name (and creates it if it doesn't yet exist). If the download_slot key is not present, it puts the request into the slot for the domain (more accurately, the hostname) the request's URL points to.

这解释了为什么您的脚本运行得更快.您创建多个下载器插槽,因为它们基于作者的姓名.如果没有,它们将根据域(始终是 stackoverflow.com)放入同一个槽中.因此,您可以有效地提高下载内容的并行度.

This explains why your script runs faster. You create multiple downloader slots because they are based on the author's name. If you did not, they would be put into the same slot based on the domain (which is always stackoverflow.com). Thus, you effectively increase the parallelism of downloading content.

这个解释有点简化,但它应该让你了解发生了什么.您可以自己查看代码.

This explanation is a little bit simplified but it should give you a picture of what's going on. You can check the code yourself.

这篇关于“download_slot"如何在scrapy中工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆