Scrapy,Python:一个管道中的多个项目类? [英] Scrapy, Python: Multiple Item Classes in one pipeline?

查看:45
本文介绍了Scrapy,Python:一个管道中的多个项目类?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Spider,它可以抓取无法保存在一个项目类中的数据.

I have a Spider that scrapes data which cannot be saved in one item class.

为了说明,我有一个配置文件项,每个配置文件项可能有未知数量的评论.这就是为什么我要实现 Profile Item 和 Comment Item.我知道我可以通过使用 yield 将它们传递到我的管道中.

For illustration, I have one Profile Item, and each Profile Item might have an unknown number of Comments. That is why I want to implement Profile Item and Comment Item. I know I can pass them to my pipeline simply by using yield.

  1. 但是,我不知道具有一个 parse_item 函数的管道如何处理两个不同的项目类?

  1. However, I do not know how a pipeline with one parse_item function can handle two different item classes?

或者是否可以使用不同的 parse_item 函数?

Or is it possible to use different parse_item functions?

还是我必须使用多个管道?

Or do I have to use several pipelines?

或者是否可以将迭代器写入 Scrapy 项目字段?

Or is it possible to write an Iterator to a Scrapy Item Field?


comments_list=[]
comments=response.xpath(somexpath)
for x in comments.extract():
        comments_list.append(x)
    ScrapyItem['comments'] =comments_list

推荐答案

默认情况下,每个项目都经过每个管道.

By default every item goes through every pipeline.

例如,如果您生成一个 ProfileItem 和一个 CommentItem,它们都将通过所有管道.如果您有一个管道设置来跟踪项目类型,那么您的 process_item 方法可能如下所示:

For instance, if you yield a ProfileItem and a CommentItem, they'll both go through all pipelines. If you have a pipeline setup to tracks item types, then your process_item method could look like:

def process_item(self, item, spider):
    self.stats.inc_value('typecount/%s' % type(item).__name__)
    return item

当一个 ProfileItem 通过时,'typecount/ProfileItem' 递增.当 CommentItem 通过时,'typecount/CommentItem' 递增.

When a ProfileItem comes through, 'typecount/ProfileItem' is incremented. When a CommentItem comes through, 'typecount/CommentItem' is incremented.

您可以让一个管道只处理一种类型的项目请求,但是,如果处理的项目类型是唯一的,请在继续之前检查项目类型:

You can have one pipeline handle only one type of item request, though, if handling that item type is unique, by checking the item type before proceeding:

def process_item(self, item, spider):
    if not isinstance(item, ProfileItem):
        return item
    # Handle your Profile Item here.

如果您在不同的管道中设置了上述两个 process_item 方法,则项目将通过这两个方法,被跟踪和处理(或在第二个被忽略).

If you had the two process_item methods above setup in different pipelines, the item will go through both of them, being tracked and being processed (or ignored on the second one).

此外,您可以设置一个管道来处理所有相关"项目:

Additionally you could have one pipeline setup to handle all 'related' items:

def process_item(self, item, spider):
    if isinstance(item, ProfileItem):
        return self.handleProfile(item, spider)
    if isinstance(item, CommentItem):
        return self.handleComment(item, spider)

def handleComment(item, spider):
    # Handle Comment here, return item

def handleProfile(item, spider):
    # Handle profile here, return item

或者,您可以让它变得更加复杂,并开发一个类型委托系统来加载类并调用默认处理程序方法,类似于 Scrapy 处理中间件/管道的方式.这完全取决于您需要它的复杂程度以及您想要做什么.

Or, you could make it even more complex and develop a type delegation system that loads classes and calls default handler methods, similar to how Scrapy handles middleware/pipelines. It's really up to you how complex you need it, and what you want to do.

这篇关于Scrapy,Python:一个管道中的多个项目类?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆