多个蜘蛛/Scrapy/管道的导出CSV丢失数据 [英] Missing data with Export CSV for multiple spiders / Scrapy / Pipeline

查看:45
本文介绍了多个蜘蛛/Scrapy/管道的导出CSV丢失数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我根据这里的一些示例实现了一个管道.我正在尝试在单个 CSV 文件中导出多个蜘蛛(由单个文件而不是在命令行中启动)的所有信息.

I implemented a pipeline based on some example around here. I'm trying to export all informations for multiple spiders (launched by a single file and not in command line) on a single CSV file.

但是,显示在 shell 中的一些数据(大约 10%)似乎没有记录到 CSV 中.这是因为蜘蛛在同一时间写作吗?

However, it appears that some data (around 10%) showed into the shell aren't recorded into CSV. Is this because spiders are writing at the same time?

如何将其修复到我的脚本中以在单个 CSV 中收集所有数据?我正在使用 CrawlerProcess 来启动蜘蛛.

How could I fix this into my script to collect all data in a single CSV? I'm using CrawlerProcess to launch the spiders.

from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter


class ScrapybotPipeline(object):

def __init__(self):
    self.files = {}

@classmethod
def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

def spider_opened(self, spider):
    file = open('result_extract.csv', 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
    self.exporter.start_exporting()

def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

推荐答案

我从您的描述中知道您正在处理多个蜘蛛.只是为了确认:您是否同时处理它们?(在同一个抓取过程中)?

I know from your description that you're handling multiple spiders. Just to confirm: are you handling them at a same time? (within a same crawl process)?

根据你分享的代码.您试图为每个蜘蛛维护一个输出文件对象,但写入所有相同的路径.在 spider_opened 中:

According to the code you shared. You're trying to maintain one output file object per spider but writing to all the same path. In spider_opened:

file = open('result_extract.csv', 'w+b')
self.files[spider] = file

这被认为是问题的根本原因.

This is believed to be the root cause of issue.

由于您只有一个文件(如在您的文件系统上)要写入,因此您只需打开一次即可.代码的修改版本:

As you're having only one file (as on your filesystem) to write to, you may do it by opening it just once. Modified version of your code:

class ScrapybotPipeline(object):

    def __init__(self):
        self.file = None

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        self.file = open('result_extract.csv', 'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

这篇关于多个蜘蛛/Scrapy/管道的导出CSV丢失数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆