多个蜘蛛/Scrapy/管道的导出CSV丢失数据 [英] Missing data with Export CSV for multiple spiders / Scrapy / Pipeline
问题描述
我根据这里的一些示例实现了一个管道.我正在尝试在单个 CSV 文件中导出多个蜘蛛(由单个文件而不是在命令行中启动)的所有信息.
I implemented a pipeline based on some example around here. I'm trying to export all informations for multiple spiders (launched by a single file and not in command line) on a single CSV file.
但是,显示在 shell 中的一些数据(大约 10%)似乎没有记录到 CSV 中.这是因为蜘蛛在同一时间写作吗?
However, it appears that some data (around 10%) showed into the shell aren't recorded into CSV. Is this because spiders are writing at the same time?
如何将其修复到我的脚本中以在单个 CSV 中收集所有数据?我正在使用 CrawlerProcess
来启动蜘蛛.
How could I fix this into my script to collect all data in a single CSV? I'm using CrawlerProcess
to launch the spiders.
from scrapy import signals
from scrapy.contrib.exporter import CsvItemExporter
class ScrapybotPipeline(object):
def __init__(self):
self.files = {}
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open('result_extract.csv', 'w+b')
self.files[spider] = file
self.exporter = CsvItemExporter(file)
self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
推荐答案
我从您的描述中知道您正在处理多个蜘蛛.只是为了确认:您是否同时处理它们?(在同一个抓取过程中)?
I know from your description that you're handling multiple spiders. Just to confirm: are you handling them at a same time? (within a same crawl process)?
根据你分享的代码.您试图为每个蜘蛛维护一个输出文件对象,但写入所有相同的路径.在 spider_opened
中:
According to the code you shared. You're trying to maintain one output file object per spider but writing to all the same path. In spider_opened
:
file = open('result_extract.csv', 'w+b')
self.files[spider] = file
这被认为是问题的根本原因.
This is believed to be the root cause of issue.
由于您只有一个文件(如在您的文件系统上)要写入,因此您只需打开一次即可.代码的修改版本:
As you're having only one file (as on your filesystem) to write to, you may do it by opening it just once. Modified version of your code:
class ScrapybotPipeline(object):
def __init__(self):
self.file = None
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
self.file = open('result_extract.csv', 'w+b')
self.exporter = CsvItemExporter(self.file)
self.exporter.fields_to_export = ['ean', 'price', 'desc', 'company']
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
self.file.close()
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
这篇关于多个蜘蛛/Scrapy/管道的导出CSV丢失数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!