Scrapy - 默默地放下一个物品 [英] Scrapy - Silently drop an item

查看:51
本文介绍了Scrapy - 默默地放下一个物品的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 抓取多个网站,这些网站可能会共享冗余信息.

I am using Scrapy to crawl several websites, which may share redundant information.

对于我抓取的每个页面,我将页面的 url、它的标题和它的 html 代码存储到 mongoDB 中.我想避免数据库中的重复,因此,我实现了一个管道以检查是否已经存储了类似的项目.在这种情况下,我会引发 DropItem 异常.

For each page I scrape, I store the url of the page, its title and its html code, into mongoDB. I want to avoid duplication in database, thus, I implement a pipeline in order to check if a similar item is already stored. In such a case, I raise a DropItem exception.

我的问题是,每当我因 DropItem 异常而删除项目时,Scrapy 都会将项目的全部内容显示到日志中(标准输出或文件).当我提取每个抓取页面的整个 HTML 代码时,如果发生丢失,整个 HTML 代码将显示在日志中.

My problem is that whenever I drop an item by raison a DropItem exception, Scrapy will display the entire content of the item into the log (stdout or file). As I'm extracting the entire HTML code of each scraped page, in case of a drop, the whole HTML code will be displayed into the log.

如何在不显示内容的情况下静默删除项目?

How could I silently drop an item without its content being shown?

感谢您的宝贵时间!

class DatabaseStorage(object):
    """ Pipeline in charge of database storage.

    The 'whole' item (with HTML and text) will be stored in mongoDB.
    """

    def __init__(self):
        self.mongo = MongoConnector().collection

    def process_item(self, item, spider):
        """ Method in charge of item valdation and processing. """
        if item['html'] and item['title'] and item['url']:
            # insert item in mongo if not already present
            if self.mongo.find_one({'title': item['title']}):
                raise DropItem('Item already in db')
            else:
                self.mongo.insert(dict(item))
                log.msg("Item %s scraped" % item['title'],
                    level=log.INFO, spider=spider)
        else:
            raise DropItem('Missing information on item %s' % (
                'scraped from ' + item.get('url')
                or item.get('title')))
        return item

推荐答案

执行此操作的正确方法似乎是为您的项目实现自定义 LogFormatter,并更改已删除项目的日志记录级别.

The proper way to do this looks to be to implement a custom LogFormatter for your project, and change the logging level of dropped items.

示例:

from scrapy import log
from scrapy import logformatter

class PoliteLogFormatter(logformatter.LogFormatter):
    def dropped(self, item, exception, response, spider):
        return {
            'level': log.DEBUG,
            'format': logformatter.DROPPEDFMT,
            'exception': exception,
            'item': item,
        }

然后在您的设置文件中,例如:

Then in your settings file, something like:

LOG_FORMATTER = 'apps.crawler.spiders.PoliteLogFormatter'

我运气不好,只是返回了None",这导致了未来管道中的异常.

I had bad luck just returning "None," which caused exceptions in future pipelines.

这篇关于Scrapy - 默默地放下一个物品的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆