Scrapy自定义管道输出文件的大小是预期大小的一半 [英] Scrapy custom pipeline outputting files half the size expected

查看:97
本文介绍了Scrapy自定义管道输出文件的大小是预期大小的一半的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为Scrapy项目创建自定义管道,以将收集的项目输出到CSV文件.为了减小每个文件的大小,我想设置每个文件可以具有的最大行数.在当前文件中达到行数限制后,将创建一个新文件以继续输出项目.

I'm trying to create a custom pipeline for a Scrapy project that outputs the collected items to CSV files. In order to keep each file's size down I want to set a maximum number of rows that each file can have. Once the line limit has been reached in the current file a new file is created to continue outputting the items.

幸运的是,我发现了一个问题,其中有人希望做同样的事情.该问题有答案,其中显示了示例实现.

Luckily, I found a question where someone was looking to do the same thing. And there's an answer to that question that shows an example implementation.

我实现了示例实现,但是调整了访问stats的方式以与Scrapy的当前版本保持一致.

I implemented the example implementation, but tweaked the way stats were accessed to align with the current version of Scrapy.

from scrapy.exporters import CsvItemExporter
import datetime

class PartitionedCsvPipeline(object):

    def __init__(self, stats):
        self.stats = stats
        self.stats.set_value('item_scraped_count', 0)
        self.base_filename = "site_{}.csv"
        self.next_split = self.split_limit = 100
        self.create_exporter()  

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

    def create_exporter(self):
        now = datetime.datetime.now()
        datetime_stamp = now.strftime("%Y%m%d%H%M")
        self.file = open(self.base_filename.format(datetime_stamp),'w+b')
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()       

    def process_item(self, item, spider):
        if self.stats.get_value('item_scraped_count') >= self.next_split:
            self.next_split += self.split_limit
            self.exporter.finish_exporting()
            self.file.close()
            self.create_exporter()
        self.exporter.export_item(item)
        self.stats.inc_value('item_scraped_count')
        return item

问题

管道的确会输出多个文件,但是文件中只有50个项目,而不是预期的100个.

The Problem

The pipeline does result in multiple files being output, but the files all have only 50 items instead of the 100 that's expected.

我做错了什么,使文件的大小只有预期的一半?

What am I doing wrong that's making the files half the size that's expected?

推荐答案

process_item()中,我添加

 print('>>> stat count:', self.stats.get_value('item_scraped_count'))

然后删除

 self.stats.inc_value('item_scraped_count')

然后我看到它仍然会增加此变量.

then I see it still increases this variable.

这意味着其他代码已经在计算刮取的值,因此您不应该增加它.

It means other code already counts scraped values so you shouldn't increase it.

如果我保留inc_value(),那么我会看到它两次计算了所有元素.

If I keep inc_value() then I see it counts all elements two times.

我不确定它是否仅计算您添加到CSV中的元素,以便您可以使用分隔变量对其进行计数

I'm not sure if it counts only elements which you add to CSV so you could use separated variable to count it

class PartitionedCsvPipeline(object):

    def __init__(self, stats):
        self.count = 0

        # ... code ...
 
def process_item(self, item, spider):
    print('>>> count:', self.count)

    if self.count >= self.next_split:
        # ... code ...

    # ... code ...

    self.count += 1

    return item


管道需要此方法来关闭最后一个文件并将所有数据保存在该文件中.

Pipeline needs this method to close last file and save all data in this file.

def close_spider(self, spider):
    self.file.close()


最小的工作示例.


Minimal working example.

我将所有文件放在一个文件中,可以在不创建项目的情况下运行python script.py.这样,每个人都可以轻松地对其进行测试.

I put all in one file and it can be run python script.py without creating project. This way everyone can easily test it.

因为我在每个文件中抓取了10个项目,所以它创建新文件的速度如此之快,以至于我不得不在文件名中添加微秒(%f)来创建唯一的名称.

Because I scrape 10 items in every file so it created new file so fast that I had to add microseconds (%f) to filename to create unique names.

import scrapy
from scrapy.exporters import CsvItemExporter
import datetime

class MySpider(scrapy.Spider):

    name = 'myspider'

    # see page created for scraping: http://toscrape.com/
    start_urls = ['http://books.toscrape.com/'] #'http://quotes.toscrape.com']

    def parse(self, response):
        print('url:', response.url)

        # download images and convert to JPG (even if it is already JPG)
        for url in response.css('img::attr(src)').extract():
            url = response.urljoin(url)
            yield {'image_urls': [url], 'session_path': 'hello_world'}


class PartitionedCsvPipeline(object):

    def __init__(self, stats):
        self.filename = "site_{}.csv"
        self.split_limit = 10
        
        self.count = 0
        self.create_exporter()  

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.stats)

    def create_exporter(self):
        now = datetime.datetime.now()
        datetime_stamp = now.strftime("%Y.%m.%d-%H.%M.%S.%f")  # %f for microseconds because sometimes it can create next file in less then 1 second and create the same name.
        
        self.file = open(self.filename.format(datetime_stamp), 'w+b')
        
        self.exporter = CsvItemExporter(self.file)
        self.exporter.start_exporting()       

    def finish_exporter(self):
        self.exporter.finish_exporting()
        self.file.close()
    
    def process_item(self, item, spider):

        if self.count >= self.split_limit:
            self.finish_exporter()
            self.count = 0
            self.create_exporter()

        self.exporter.export_item(item)
        self.count += 1
        print('self.count:', self.count)

        return item
        
    def close_spider(self, spider):
        self.finish_exporter()
        
# --- run without project and save in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',

    'ITEM_PIPELINES': {'__main__.PartitionedCsvPipeline': 1},   # used Pipeline create in current file (needs __main___)
})

c.crawl(MySpider)
c.start() 

这篇关于Scrapy自定义管道输出文件的大小是预期大小的一半的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆