在scrapy中嵌套项目数据的正确方法 [英] correct way to nest Item data in scrapy
问题描述
嵌套 Item 数据的正确方法是什么?
What is the correct way to nest Item data?
例如,我想要一个产品的输出:
For example, I want the output of a product:
{
'price': price,
'title': title,
'meta': {
'url': url,
'added_on': added_on
}
我有scrapy.Item:
I have scrapy.Item of:
class ProductItem(scrapy.Item):
url = scrapy.Field(output_processor=TakeFirst())
price = scrapy.Field(output_processor=TakeFirst())
title = scrapy.Field(output_processor=TakeFirst())
url = scrapy.Field(output_processor=TakeFirst())
added_on = scrapy.Field(output_processor=TakeFirst())
现在,我的做法只是根据新的项目模板重新格式化管道中的整个项目:
Now, the way I do it is just to reformat the whole item in the pipeline according to new item template:
class FormatedItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
meta = scrapy.Field()
并在管道中:
def process_item(self, item, spider):
formated_item = FormatedItem()
formated_item['title'] = item['title']
formated_item['price'] = item['price']
formated_item['meta'] = {
'url': item['url'],
'added_on': item['added_on']
}
return formated_item
这是解决这个问题的正确方法还是有更直接的方法来解决这个问题而不破坏框架的哲学?
Is this correct way to approach this or is there a more straight-forward way to approach this without breaking the philosophy of the framework?
推荐答案
UPDATE 来自评论:看起来像 嵌套加载器 是更新的方法.另一个评论表明这种方法会在序列化过程中导致错误.
UPDATE from comments: Looks like nested loaders is the updated approach. Another comment suggests this approach will cause errors during serialization.
解决这个问题的最好方法是创建一个 main
和一个 meta
项目类/加载器.
Best way to approach this is by creating a main
and a meta
item class/loader.
from scrapy.item import Item, Field
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst
class MetaItem(Item):
url = Field()
added_on = Field()
class MainItem(Item):
price = Field()
title = Field()
meta = Field(serializer=MetaItem)
class MainItemLoader(ItemLoader):
default_item_class = MainItem
default_output_processor = TakeFirst()
class MetaItemLoader(ItemLoader):
default_item_class = MetaItem
default_output_processor = TakeFirst()
示例用法:
from scrapy.spider import Spider
from qwerty.items import MainItemLoader, MetaItemLoader
from scrapy.selector import Selector
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = ["http://example.com"]
def parse(self, response):
mainloader = MainItemLoader(selector=Selector(response))
mainloader.add_value('title', 'test')
mainloader.add_value('price', 'price')
mainloader.add_value('meta', self.get_meta(response))
return mainloader.load_item()
def get_meta(self, response):
metaloader = MetaItemLoader(selector=Selector(response))
metaloader.add_value('url', response.url)
metaloader.add_value('added_on', 'now')
return metaloader.load_item()
之后,您可以通过创建更多子项目"来轻松扩展您的项目.
After that, you can easily expand your items in the future by creating more "sub-items."
这篇关于在scrapy中嵌套项目数据的正确方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!