Scrapy:与二级网站交互时的程序组织 [英] Scrapy : Program organization when interacting with secondary website

查看:39
本文介绍了Scrapy:与二级网站交互时的程序组织的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Scrapy 1.1 并且我有一个项目,我有一个蜘蛛1"抓取站点 A(在那里我获取了 90% 的信息来填充我的项目).但是,根据站点 A 抓取的结果,我可能需要从站点 B 抓取其他信息.就开发程序而言,在蜘蛛1"中抓取站点 B 是否更有意义,或者是否可以进行交互来自管道对象内的站点 B.我更喜欢后者,认为它可以解耦 2 个站点的抓取,但我不确定这是否可能或处理此用例的最佳方法.另一种方法可能是对站点 B 使用第二个蜘蛛(蜘蛛 '2'),但是我假设我必须让蜘蛛 '1' 运行,保存到 db 然后运行蜘蛛 '2'.无论如何,任何建议将不胜感激.

I'm working with Scrapy 1.1 and I have a project where I have spider '1' scrape site A (where I aquire 90% of the information to fill my items). However depending on the results of the Site A scrape, I may need to scrape additional information from site B. As far as developing the program, does it make more sense to scrape site B within spider '1' or would it be possible to interact site B from within a pipeline object. I prefer the latter, thinking that it decouples the scraping of 2 sites, but I'm not sure if this is possible or the best way to handle this use case. Another approach might be to use a second spider (spider '2') for site B, but then I would assume that I would have to let spider '1' run, save to db then run spider '2' . Anyway any advice would be appreciated.

推荐答案

这两种方法都很常见,这只是一个偏好问题.对于将所有内容都包含在一个蜘蛛中的案例,这听起来像是一个直接的解决方案.

Both approaches are very common and this just a question of preference. For your case containing everything in one spider sounds like a straight-forward solution.

您可以将 url 字段添加到您的项目中,并安排稍后在管道中解析它:

You can add url field to your item and schedule and parse it later in the pipeline:

class MyPipeline(object):
    def __init__(self, crawler):
        self.crawler = crawler

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def process_item(self, item, spider):
        extra_url = item.get('extra_url', None)
        if not extra_url:
            return item
        req = Request(url=extra_url
                      callback=self.custom_callback,
                      meta={'item': item},)
        self.crawler.engine.crawl(req, spider)
        # you have to drop the item here since you will return it later anyway
        raise DropItem()

    def custom_callback(self, response):
        # retrieve your item
        item = response.mete['item']
        # do something to add to item
        item['some_extra_stuff'] = ...
        del item['extra_url'] 
        yield item

以上代码的作用是检查 item 是否有一些 url 字段,如果有,它会删除 item 并安排新的请求.该请求会用一些额外的数据填充该项目并将其发送回管道.

What the above code does is checks whether item has some url field, if it does it drops the item and schedules a new request. That requests fills up the item with some extra data and sends it back to the pipeline.

这篇关于Scrapy:与二级网站交互时的程序组织的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆