Scrapy 数据流和项目以及项目加载器 [英] Scrapy Data Flow and Items and Item Loaders
问题描述
我正在查看 Scrapy 中的
Scrapy 项目的默认文件结构
scrapy.cfg我的项目/__init__.py项目.py中间件.py管道.py设置.py蜘蛛/__init__.py蜘蛛1.py蜘蛛2.py...
item.py
# -*- 编码:utf-8 -*-# 在此处定义已抓取项目的模型## 查看文档:# https://doc.scrapy.org/en/latest/topics/items.html导入scrapy类 MyprojectItem(scrapy.Item):# 在此处为您的项目定义字段,例如:# name = scrapy.Field()经过
我假设会变成
导入scrapy类产品(scrapy.Item):名称 = scrapy.Field()价格=scrapy.Field()股票=scrapy.Field()last_updated = scrapy.Field(serializer=str)
以便在尝试填充 Product
实例的未声明字段时抛出错误
问题 #1
如果我们在 items.py
中创建了 class CrowdfundingItem
,我们的爬虫在哪里、何时以及如何知道 items.py
?
这是在...中完成的
__init__.py
?my_crawler.py
?def __init__()
ofmycrawler.py
?settings.py
?pipelines.py
?def __init__(self, dbpool)
ofpipelines.py
?- 其他地方?
问题 2
一旦我声明了一个项目,例如 Product
,我如何通过在类似于下面的上下文中创建 Product
的实例来存储数据?>
导入scrapy类 MycrawlerSpider(CrawlSpider):名称 = '我的爬虫'allowed_domains = ['google.com']start_urls = ['https://www.google.com/']定义解析(自我,响应):选项 = 选项()options.add_argument('-headless')浏览器 = webdriver.Firefox(firefox_options=options)browser.get(self.start_urls[0])element = browser.find_elements_by_xpath('//section')计数 = 0对于元素中的 ele:name = browser.find_element_by_xpath('./div[@id="name"]').textprice = browser.find_element_by_xpath('./div[@id="price"]').text# 如果我不确定会有多少项目,# 因此我不能明确声明它们,# 我将如何创建产品的命名实例?# 显然下面的代码行不通,但是你怎么能做到这一点呢?计数 += 1varName + count = Product(名称=名称,价格=价格)...
最后,假设我们完全放弃对 Product
实例的命名,而是简单地创建未命名的实例.
对于元素中的ele:name = browser.find_element_by_xpath('./div[@id="name"]').textprice = browser.find_element_by_xpath('./div[@id="price"]').text产品(名称=名称,价格=价格)
如果这些实例确实存储在某个地方,它们存储在哪里?这样创建实例,是不是就不能访问了?
Using an Item
是可选的;它们只是声明数据模型和应用验证的便捷方式.您也可以使用普通的 dict
代替.
如果您选择使用Item
,则需要导入它以在蜘蛛中使用.它不是自动发现的.在你的情况下:
from items import CrowdfundingItem
当蜘蛛在每个页面上运行 parse
方法时,您可以将提取的数据加载到您的 Item
或 dict
中.加载后,yield
将其传递回scrapy 引擎,以便在管道或导出器中进行下游处理.这就是scrapy如何存储"您抓取的数据.
例如:
yield Product(name='Desktop PC', price=1000) # uses Itemyield {'name':'Desktop PC', 'price':1000} # 普通字典
I am looking at the Architecture Overview page in the Scrapy documentation, but I still have a few questions regarding data and or control flow.
Scrapy Architecture
Default File Structure of Scrapy Projects
scrapy.cfg
myproject/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
spider1.py
spider2.py
...
item.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class MyprojectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
which, I'm assuming, becomes
import scrapy
class Product(scrapy.Item):
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
so that errors are thrown when trying to populate undeclared fields of Product
instances
>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
...
KeyError: 'Product does not support field: lala'
Question #1
Where, when, and how does our crawler become aware of items.py
if we have created class CrowdfundingItem
in items.py
?
Is this done in...
__init__.py
?my_crawler.py
?def __init__()
ofmycrawler.py
?settings.py
?pipelines.py
?def __init__(self, dbpool)
ofpipelines.py
?- somewhere else?
Question #2
Once I have declared an item such as Product
, how do I then store the data by creating instances of Product
in a context similar to the one below?
import scrapy
class MycrawlerSpider(CrawlSpider):
name = 'mycrawler'
allowed_domains = ['google.com']
start_urls = ['https://www.google.com/']
def parse(self, response):
options = Options()
options.add_argument('-headless')
browser = webdriver.Firefox(firefox_options=options)
browser.get(self.start_urls[0])
elements = browser.find_elements_by_xpath('//section')
count = 0
for ele in elements:
name = browser.find_element_by_xpath('./div[@id="name"]').text
price = browser.find_element_by_xpath('./div[@id="price"]').text
# If I am not sure how many items there will be,
# and hence I cannot declare them explicitly,
# how I would go about creating named instances of Product?
# Obviously the code below will not work, but how can you accomplish this?
count += 1
varName + count = Product(name=name, price=price)
...
Lastly, say we forego naming the Product
instances altogether, and instead simply create unnamed instances.
for ele in elements:
name = browser.find_element_by_xpath('./div[@id="name"]').text
price = browser.find_element_by_xpath('./div[@id="price"]').text
Product(name=name, price=price)
If such instances are indeed stored somewhere, where are they stored? By creating instances this way, would it be impossible to access them?
Using an Item
is optional; they're just a convenient way to declare your data model and apply validation. You can also use a plain dict
instead.
If you do choose to use Item
, you will need to import it for use in the spider. It's not discovered automatically. In your case:
from items import CrowdfundingItem
As a spider runs the parse
method on each page, you can load the extracted data into your Item
or dict
. Once it's loaded, yield
it, which passes it back to the scrapy engine for processing downstream, in pipelines or exporters. This is how scrapy enables "storage" of the data you scrape.
For example:
yield Product(name='Desktop PC', price=1000) # uses Item
yield {'name':'Desktop PC', 'price':1000} # plain dict
这篇关于Scrapy 数据流和项目以及项目加载器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!