Scrapy 数据流和项目以及项目加载器 [英] Scrapy Data Flow and Items and Item Loaders

查看:37
本文介绍了Scrapy 数据流和项目以及项目加载器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查看 Scrapy 中的

Scrapy 项目的默认文件结构

scrapy.cfg我的项目/__init__.py项目.py中间件.py管道.py设置.py蜘蛛/__init__.py蜘蛛1.py蜘蛛2.py...

item.py

# -*- 编码:utf-8 -*-# 在此处定义已抓取项目的模型## 查看文档:# https://doc.scrapy.org/en/latest/topics/items.html导入scrapy类 MyprojectItem(scrapy.Item):# 在此处为您的项目定义字段,例如:# name = scrapy.Field()经过

我假设会变成

导入scrapy类产品(scrapy.Item):名称 = scrapy.Field()价格=scrapy.Field()股票=scrapy.Field()last_updated = scrapy.Field(serializer=str)

以便在尝试填充 Product 实例的未声明字段时抛出错误

<预><代码>>>>产品 = 产品(名称='台式电脑',价格=1000)>>>产品['拉拉'] = '测试'回溯(最近一次调用最后一次):...KeyError: '产品不支持字段:lala'

问题 #1

如果我们在 items.py 中创建了 class CrowdfundingItem,我们的爬虫在哪里、何时以及如何知道 items.py?

这是在...中完成的

  • __init__.py?
  • my_crawler.py?
  • def __init__() of mycrawler.py?
  • settings.py?
  • pipelines.py?
  • def __init__(self, dbpool) of pipelines.py?
  • 其他地方?

问题 2

一旦我声明了一个项目,例如 Product,我如何通过在类似于下面的上下文中创建 Product 的实例来存储数据?

导入scrapy类 MycrawlerSpider(CrawlSpider):名称 = '我的爬虫'allowed_domains = ['google.com']start_urls = ['https://www.google.com/']定义解析(自我,响应):选项 = 选项()options.add_argument('-headless')浏览器 = webdriver.Firefox(firefox_options=options)browser.get(self.start_urls[0])element = browser.find_elements_by_xpath('//section')计数 = 0对于元素中的 ele:name = browser.find_element_by_xpath('./div[@id="name"]').textprice = browser.find_element_by_xpath('./div[@id="price"]').text# 如果我不确定会有多少项目,# 因此我不能明确声明它们,# 我将如何创建产品的命名实例?# 显然下面的代码行不通,但是你怎么能做到这一点呢?计数 += 1varName + count = Product(名称=名称,价格=价格)...

最后,假设我们完全放弃对 Product 实例的命名,而是简单地创建未命名的实例.

对于元素中的ele:name = browser.find_element_by_xpath('./div[@id="name"]').textprice = browser.find_element_by_xpath('./div[@id="price"]').text产品(名称=名称,价格=价格)

如果这些实例确实存储在某个地方,它们存储在哪里?这样创建实例,是不是就不能访问了?

解决方案

Using an Item 是可选的;它们只是声明数据模型和应用验证的便捷方式.您也可以使用普通的 dict 代替.

如果您选择使用Item,则需要导入它以在蜘蛛中使用.它不是自动发现的.在你的情况下:

from items import CrowdfundingItem

当蜘蛛在每个页面上运行 parse 方法时,您可以将提取的数据加载到您的 Itemdict 中.加载后,yield 将其传递回scrapy 引擎,以便在管道或导出器中进行下游处理.这就是scrapy如何存储"您抓取的数据.

例如:

yield Product(name='Desktop PC', price=1000) # uses Itemyield {'name':'Desktop PC', 'price':1000} # 普通字典

I am looking at the Architecture Overview page in the Scrapy documentation, but I still have a few questions regarding data and or control flow.

Scrapy Architecture

Default File Structure of Scrapy Projects

scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
    ...

item.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

which, I'm assuming, becomes

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

so that errors are thrown when trying to populate undeclared fields of Product instances

>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

Question #1

Where, when, and how does our crawler become aware of items.py if we have created class CrowdfundingItem in items.py?

Is this done in...

  • __init__.py?
  • my_crawler.py?
  • def __init__() of mycrawler.py?
  • settings.py?
  • pipelines.py?
  • def __init__(self, dbpool) of pipelines.py?
  • somewhere else?

Question #2

Once I have declared an item such as Product, how do I then store the data by creating instances of Product in a context similar to the one below?

import scrapy

class MycrawlerSpider(CrawlSpider):
    name = 'mycrawler'
    allowed_domains = ['google.com']
    start_urls = ['https://www.google.com/']
    def parse(self, response):
        options = Options()
        options.add_argument('-headless')
        browser = webdriver.Firefox(firefox_options=options)
        browser.get(self.start_urls[0])
        elements = browser.find_elements_by_xpath('//section')
        count = 0
        for ele in elements:
             name = browser.find_element_by_xpath('./div[@id="name"]').text
             price = browser.find_element_by_xpath('./div[@id="price"]').text

             # If I am not sure how many items there will be,
             # and hence I cannot declare them explicitly,
             # how I would go about creating named instances of Product?

             # Obviously the code below will not work, but how can you accomplish this?

             count += 1
             varName + count = Product(name=name, price=price)
             ...

Lastly, say we forego naming the Product instances altogether, and instead simply create unnamed instances.

for ele in elements:
    name = browser.find_element_by_xpath('./div[@id="name"]').text
    price = browser.find_element_by_xpath('./div[@id="price"]').text
    Product(name=name, price=price)

If such instances are indeed stored somewhere, where are they stored? By creating instances this way, would it be impossible to access them?

解决方案

Using an Item is optional; they're just a convenient way to declare your data model and apply validation. You can also use a plain dict instead.

If you do choose to use Item, you will need to import it for use in the spider. It's not discovered automatically. In your case:

from items import CrowdfundingItem

As a spider runs the parse method on each page, you can load the extracted data into your Item or dict. Once it's loaded, yield it, which passes it back to the scrapy engine for processing downstream, in pipelines or exporters. This is how scrapy enables "storage" of the data you scrape.

For example:

yield Product(name='Desktop PC', price=1000) # uses Item
yield {'name':'Desktop PC', 'price':1000} # plain dict

这篇关于Scrapy 数据流和项目以及项目加载器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆