如何创建自定义的Scrapy项导出器? [英] How to create custom Scrapy Item Exporter?

查看:67
本文介绍了如何创建自定义的Scrapy项导出器?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于JsonLinesItemExporter创建一个自定义的Scrapy Item Exporter,以便我可以稍微更改它产生的结构.

I'm trying to create a custom Scrapy Item Exporter based off JsonLinesItemExporter so I can slightly alter the structure it produces.

我已在此处阅读文档 http://doc.scrapy.org/zh_CN/latest/topics/exporters.html ,但未说明如何创建自定义导出器,将其存储在何处或如何将其链接到管道.

I have read the documentation here http://doc.scrapy.org/en/latest/topics/exporters.html but it doesn't state how to create a custom exporter, where to store it or how to link it to your Pipeline.

我已经确定了如何对Feed Exporters进行自定义,但这不符合我的要求,因为我想通过管道调用此Exporter.

I have identified how to go custom with the Feed Exporters but this is not going to suit my requirements, as I want to call this exporter from my Pipeline.

这是我想出的代码,它存储在项目exporters.py

Here is the code I've come up with which has been stored in a file in the root of the project called exporters.py

from scrapy.contrib.exporter import JsonLinesItemExporter

class FanItemExporter(JsonLinesItemExporter):

def __init__(self, file, **kwargs):
    self._configure(kwargs, dont_fail=True)
    self.file = file
    self.encoder = ScrapyJSONEncoder(**kwargs)
    self.first_item = True

def start_exporting(self):
    self.file.write("""{
'product': [""")

def finish_exporting(self):
    self.file.write("]}")

def export_item(self, item):
    if self.first_item:
        self.first_item = False
    else:
        self.file.write(',\n')
    itemdict = dict(self._get_serialized_fields(item))
    self.file.write(self.encoder.encode(itemdict))

我只是尝试使用FanItemExporter从我的管道中调用此方法,并尝试导入的变体,但不会产生任何结果.

I have simply tried calling this from my pipeline by using FanItemExporter and trying variations of the import but it's not resulting in anything.

推荐答案

确实,Scrapy文档没有明确说明放置项目导出器的位置.要使用项目导出器,请按照以下步骤操作.

It is true that the Scrapy documentation does not clearly state where to place an Item Exporter. To use an Item Exporter, these are the steps to follow.

  1. 选择一个Item Exporter类并将其导入到项目目录中的pipeline.py.它可以是预定义的项目导出器(例如XmlItemExporter),也可以是用户定义的(例如问题中定义的FanItemExporter)
  2. pipeline.py中创建一个Item Pipeline类.在该类中实例化导入的Item Exporter.详细信息将在答案的后面部分进行解释.
  3. 现在,将此管道类注册到settings.py文件中.
  1. Choose an Item Exporter class and import it to pipeline.py in the project directory. It could be a pre-defined Item Exporter (ex. XmlItemExporter) or user-defined (like FanItemExporter defined in the question)
  2. Create an Item Pipeline class in pipeline.py. Instantiate the imported Item Exporter in this class. Details will be explained in the later part of the answer.
  3. Now, register this pipeline class in settings.py file.

以下是每个步骤的详细说明.该问题的解决方案包含在每个步骤中.

Following is a detailed explanation of each step. Solution to the question is included in each step.

  • 如果使用预定义的Item Exporter类,请从scrapy.exporters模块导入.
    前任: from scrapy.exporters import XmlItemExporter

  • If using a pre-defined Item Exporter class, import it from scrapy.exporters module.
    Ex: from scrapy.exporters import XmlItemExporter

如果需要自定义导出器,请在文件中定义一个自定义类.我建议将类放在exporters.py文件中.将此文件放在项目文件夹中(settings.pyitems.py所在的位置).

If you need a custom exporter, define a custom class in a file. I suggest placing the class in exporters.py file. Place this file in the project folder (where settings.py, items.py reside).

在创建新的子类时,导入BaseItemExporter始终是一个好主意.如果我们打算完全更改功能,那将是适当的.但是,在这个问题上,大多数功能都接近JsonLinesItemExporter.

While creating a new sub-class, it is always a good idea to import BaseItemExporter. It would be apt if we intend to change the functionality entirely. However, in this question, most of the functionality is close to JsonLinesItemExporter.

因此,我附加了同一ItemExporter的两个版本.一个版本扩展了BaseItemExporter类,而另一个版本扩展了JsonLinesItemExporter

Hence, I am attaching two versions of the same ItemExporter. One version extends BaseItemExporter class and the other extends JsonLinesItemExporter class

版本1 :扩展BaseItemExporter

由于BaseItemExporter是父类,因此必须覆盖start_exporting()finish_exporting()export_item()以适应我们的需求.

Since BaseItemExporter is the parent class, start_exporting(), finish_exporting(), export_item() must be overrided to suit our needs.

from scrapy.exporters import BaseItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy.utils.python import to_bytes

class FanItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(b',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(to_bytes(self.encoder.encode(itemdict)))

版本2 :扩展JsonLinesItemExporter

JsonLinesItemExporter提供与export_item()方法完全相同的实现.因此,仅start_exporting()finish_exporting()方法被覆盖.

JsonLinesItemExporter provides the exact same implementation of export_item() method. Therefore only start_exporting() and finish_exporting() methods are overrided.

JsonLinesItemExporter的实现可以在文件夹python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py

Implementation of JsonLinesItemExporter can be seen in the folder python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py

from scrapy.exporters import JsonItemExporter

class FanItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        # To initialize the object using JsonItemExporter's constructor
        super().__init__(file)

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

注意:将数据写入文件时,请注意标准的Item Exporter类需要二进制文件.因此,必须以二进制模式(b)打开文件.由于相同的原因,两个版本中的write()方法都将bytes写入文件.

Note: When writing data to file, it is important to note that the standard Item Exporter classes expect binary files. Hence, the file must be opened in binary mode (b). For the same reason, write() method in both the version write bytes to file.

创建Item Pipeline类.

Creating an Item Pipeline class.

from project_name.exporters import FanItemExporter

class FanExportPipeline(object):
    def __init__(self, file_name):
        # Storing output filename
        self.file_name = file_name
        # Creating a file handle and setting it to None
        self.file_handle = None

    @classmethod
    def from_crawler(cls, crawler):
        # getting the value of FILE_NAME field from settings.py
        output_file_name = crawler.settings.get('FILE_NAME')

        # cls() calls FanExportPipeline's constructor
        # Returning a FanExportPipeline object
        return cls(output_file_name)

    def open_spider(self, spider):
        print('Custom export opened')

        # Opening file in binary-write mode
        file = open(self.file_name, 'wb')
        self.file_handle = file

        # Creating a FanItemExporter object and initiating export
        self.exporter = FanItemExporter(file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        print('Custom Exporter closed')

        # Ending the export to file from FanItemExport object
        self.exporter.finish_exporting()

        # Closing the opened output file
        self.file_handle.close()

    def process_item(self, item, spider):
        # passing the item to FanItemExporter object for expoting to file
        self.exporter.export_item(item)
        return item

步骤3

由于定义了项目导出管道",请将该管道注册到settings.py文件中.还将字段FILE_NAME添加到settings.py文件.此字段包含输出文件的文件名.

Step 3

Since the Item Export Pipeline is defined, register this pipeline in settings.py file. Also add the field FILE_NAME to settings.py file. This field contains the filename of the output file.

将以下行添加到settings.py文件.

Add the following lines to settings.py file.

FILE_NAME = 'path/outputfile.ext'
ITEM_PIPELINES = {
    'project_name.pipelines.FanExportPipeline' : 600,
}

如果ITEM_PIPELINES已经取消注释,则将以下行添加到ITEM_PIPELINES字典中.

If ITEM_PIPELINES is already uncommented, then add the following line to the ITEM_PIPELINES dictionary.

'project_name.pipelines.FanExportPipeline' : 600,

这是创建自定义项目导出管道的一种方法.

This is one way to create a custom Item Export pipeline.

注意:

这篇关于如何创建自定义的Scrapy项导出器?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆