每个 start_url 抓取了多少项 [英] How many items has been scraped per start_url

查看:50
本文介绍了每个 start_url 抓取了多少项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用scrapy 抓取1000 个网址并将抓取的项目存储在mongodb 中.我想知道为每个网址找到了多少项目.从scrapy stats我可以看到'item_scraped_count': 3500但是,我需要为每个 start_url 分别计算这个计数.每个项目还有 referer 字段,我可能会用它来手动计算每个 url 项目:

I use scrapy to crawl 1000 urls and store scraped item in a mongodb. I'd to know how many items have been found for each url. From scrapy stats I can see 'item_scraped_count': 3500 However, I need this count for each start_url separately. There is also referer field for each item that I might use to count each url items manually:

2016-05-24 15:15:10 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6w-_ucPV674> (referer: https://www.youtube.com/results?q=billys&sp=EgQIAhAB)

但我想知道是否有来自 scrapy 的内置支持.

But I wonder if there is a built-in support from scrapy.

推荐答案

接受挑战!

scrapy 上没有直接支持此功能的内容,但您可以使用 Spider Middleware:

there isn't something on scrapy that directly supports this, but you could separate it from your spider code with a Spider Middleware:

middlewares.py

from scrapy.http.request import Request

class StartRequestsCountMiddleware(object):

    start_urls = {}

    def process_start_requests(self, start_requests, spider):
        for i, request in enumerate(start_requests):
            self.start_urls[i] = request.url
            request.meta.update(start_request_index=i)
            yield request

    def process_spider_output(self, response, result, spider):
        for output in result:
            if isinstance(output, Request):
                output.meta.update(
                    start_request_index=response.meta['start_request_index'],
                )
            else:
                spider.crawler.stats.inc_value(
                    'start_requests/item_scraped_count/{}'.format(
                        self.start_urls[response.meta['start_request_index']],
                    ),
                )
            yield output

记得在 settings.py 上激活它:

SPIDER_MIDDLEWARES = {
    ...
    'myproject.middlewares.StartRequestsCountMiddleware': 200,
}

现在您应该能够在蜘蛛统计数据中看到类似的内容:

Now you should be able to see something like this on your spider stats:

'start_requests/item_scraped_count/START_URL1': ITEMCOUNT1,
'start_requests/item_scraped_count/START_URL2': ITEMCOUNT2,

这篇关于每个 start_url 抓取了多少项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆