在scrapy中项目与项目加载器 [英] Items vs item loaders in scrapy

查看:34
本文介绍了在scrapy中项目与项目加载器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对scrapy很陌生,我知道项目用于填充抓取的数据,但我无法理解项目和项目加载器之间的区别.我试图阅读一些示例代码,他们使用项目加载器而不是项目来存储,我不明白为什么.Scrapy 文档对我来说还不够清楚.任何人都可以就何时使用物品加载器以及它们为物品提供哪些附加设施提供一个简单的解释(以示例为佳)?

I'm pretty new to scrapy, I know that items are used to populate scraped data, but I cant understand the difference between items and item loaders. I tried to read some example codes, they used item loaders to store instead of items and I can't understand why. Scrapy documentation wasn't clear enough for me. Can anyone give a simple explanation (better with example) about when item loaders are used and what additional facilities do they provide over items ?

推荐答案

我真的很喜欢文档中的官方解释:

I really like the official explanation in the docs:

Item Loaders 提供了一种方便的机制来填充已抓取的项目.即使 Items 可以使用它们自己的类似字典的 API,Item Loaders 提供了更方便的 API用于从抓取过程中填充它们,通过自动化一些常见的在分配之前解析原始提取的数据等任务.

Item Loaders provide a convenient mechanism for populating scraped Items. Even though Items can be populated using their own dictionary-like API, Item Loaders provide a much more convenient API for populating them from a scraping process, by automating some common tasks like parsing the raw extracted data before assigning it.

换句话说,Items 提供了抓取数据的容器,而项目加载器提供填充该容器的机制.

最后一段应该回答你的问题.
项目加载器非常棒,因为它们允许您拥有如此多的处理快捷方式并重用大量代码以保持一切整洁、干净和易于理解.

Last paragraph should answer your question.
Item loaders are great since they allow you to have so many processing shortcuts and reuse a bunch of code to keep everything tidy, clean and understandable.

比较示例案例.假设我们要抓取此项目:

Comparison example case. Lets say we want to scrape this item:

class MyItem(Item):
    full_name = Field()
    bio = Field()
    age = Field()
    weight = Field()
    height = Field()

仅项目方法看起来像这样:

Item only approach would look something like this:

def parse(self, response):
    full_name = response.xpath("//div[contains(@class,'name')]/text()").extract()
    # i.e. returns ugly ['John\n', '\n\t  ', '  Snow']
    item['full_name'] = ' '.join(i.strip() for i in full_name if i.strip())
    bio = response.xpath("//div[contains(@class,'bio')]/text()").extract()
    item['bio'] = ' '.join(i.strip() for i in full_name if i.strip())
    age = response.xpath("//div[@class='age']/text()").extract_first(0)
    item['age'] = int(age) 
    weight = response.xpath("//div[@class='weight']/text()").extract_first(0)
    item['weight'] = int(age) 
    height = response.xpath("//div[@class='height']/text()").extract_first(0)
    item['height'] = int(age) 
    return item

vs Item Loaders 方法:

vs Item Loaders approach:

# define once in items.py 
from scrapy.loader.processors import Compose, MapCompose, Join, TakeFirst
clean_text = Compose(MapCompose(lambda v: v.strip()), Join())   
to_int = Compose(TakeFirst(), int)

class MyItemLoader(ItemLoader):
    default_item_class = MyItem
    full_name_out = clean_text
    bio_out = clean_text
    age_out = to_int
    weight_out = to_int
    height_out = to_int

# parse as many different places and times as you want  
def parse(self, response):
    loader = MyItemLoader(selector=response)
    loader.add_xpath('full_name', "//div[contains(@class,'name')]/text()")
    loader.add_xpath('bio', "//div[contains(@class,'bio')]/text()")
    loader.add_xpath('age', "//div[@class='age']/text()")
    loader.add_xpath('weight', "//div[@class='weight']/text()")
    loader.add_xpath('height', "//div[@class='height']/text()")
    return loader.load_item()

如您所见,Item Loader 更加简洁且易于扩展.假设您有 20 个以上的字段,其中很多字段共享相同的处理逻辑,如果没有 Item Loaders 将是自杀.Item Loaders 很棒,您应该使用它们!

As you can see the Item Loader is so much cleaner and easier to scale. Let's say you have 20 more fields from which a lot share the same processing logic, would be a suicide to do it without Item Loaders. Item Loaders are awesome and you should use them!

这篇关于在scrapy中项目与项目加载器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆