Scrapy:如何在Spider中使用项目以及如何将项目发送到管道? [英] Scrapy: how to use items in spider and how to send items to pipelines?

查看:191
本文介绍了Scrapy:如何在Spider中使用项目以及如何将项目发送到管道?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是scrapy的新手,我的任务很简单:

I am new to scrapy and my task is simple:

对于给定的电子商务网站:

For a given e-commerce website:

  • 抓取所有网站页面

  • crawl all website pages

查找产品页面

如果URL指向产品页面

If the URL point to a product page

创建项目

处理该项目以将其存储在数据库中

Process the item to store it in a database

我创建了蜘蛛,但是产品只是打印在一个简单的文件中.

I created the spider but products are just printed in a simple file.

我的问题是关于项目结构的:如何在Spider中使用项目以及如何将项目发送到管道?

My question is about the project structure: how to use items in spider and how to send items to pipelines ?

我找不到使用项目和管道的项目的简单示例.

I can't find a simple example of a project using items and pipelines.

推荐答案

  • 如何使用蜘蛛网中的物品?
  • 好吧,项目的主要目的是存储您爬网的数据. scrapy.Items基本上是字典.要声明您的物品,您将必须创建一个类并在其中添加scrapy.Field:

    Well, the main purpose of items is to store the data you crawled. scrapy.Items are basically dictionaries. To declare your items, you will have to create a class and add scrapy.Field in it:

    import scrapy
    
    class Product(scrapy.Item):
        url = scrapy.Field()
        title = scrapy.Field()
    

    您现在可以通过导入产品在蜘蛛中使用它.

    You can now use it in your spider by importing your Product.

    有关高级信息,我让您检查文档这里

    For advanced information, I let you check the doc here

    • 如何将项目发送到管道?

    首先,您需要告诉蜘蛛使用您的custom pipeline.

    First, you need to tell to your spider to use your custom pipeline.

    settings.py 文件中:

    ITEM_PIPELINES = {
        'myproject.pipelines.CustomPipeline': 300,
    }
    

    您现在可以编写管道并处理您的项目.

    You can now write your pipeline and play with your item.

    pipeline.py 文件中:

    from scrapy.exceptions import DropItem
    
    class CustomPipeline(object):
       def __init__(self):
            # Create your database connection
    
        def process_item(self, item, spider):
            # Here you can index your item
            return item
    

    最后,在您的蜘蛛中,一旦物品被填满,就需要yield.

    Finally, in your spider, you need to yield your item once it is filled.

    spider.py 示例:

    import scrapy
    from myspider.items import Product
    
    class MySpider(scrapy.Spider):
        name = "test"
        start_urls = [
            'http://www.exemple.com',
        ]
    def parse(self, response):
        doc = Product()
        doc['url'] = response.url
        doc['title'] = response.xpath('//div/p/text()')
        yield doc # Will go to your pipeline
    

    希望这会有所帮助,这是管道的文档:项目管道

    Hope this helps, here is the doc for pipelines: Item Pipeline

    这篇关于Scrapy:如何在Spider中使用项目以及如何将项目发送到管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆