Scrapy:如何在Spider中使用项目以及如何将项目发送到管道? [英] Scrapy: how to use items in spider and how to send items to pipelines?
问题描述
我是scrapy
的新手,我的任务很简单:
I am new to scrapy
and my task is simple:
对于给定的电子商务网站:
For a given e-commerce website:
-
抓取所有网站页面
crawl all website pages
查找产品页面
如果URL指向产品页面
If the URL point to a product page
创建项目
处理该项目以将其存储在数据库中
Process the item to store it in a database
我创建了蜘蛛,但是产品只是打印在一个简单的文件中.
I created the spider but products are just printed in a simple file.
我的问题是关于项目结构的:如何在Spider中使用项目以及如何将项目发送到管道?
My question is about the project structure: how to use items in spider and how to send items to pipelines ?
我找不到使用项目和管道的项目的简单示例.
I can't find a simple example of a project using items and pipelines.
推荐答案
- 如何使用蜘蛛网中的物品?
好吧,项目的主要目的是存储您爬网的数据. scrapy.Items
基本上是字典.要声明您的物品,您将必须创建一个类并在其中添加scrapy.Field
:
Well, the main purpose of items is to store the data you crawled. scrapy.Items
are basically dictionaries. To declare your items, you will have to create a class and add scrapy.Field
in it:
import scrapy
class Product(scrapy.Item):
url = scrapy.Field()
title = scrapy.Field()
您现在可以通过导入产品在蜘蛛中使用它.
You can now use it in your spider by importing your Product.
有关高级信息,我让您检查文档这里
For advanced information, I let you check the doc here
- 如何将项目发送到管道?
首先,您需要告诉蜘蛛使用您的custom pipeline
.
First, you need to tell to your spider to use your custom pipeline
.
在 settings.py 文件中:
ITEM_PIPELINES = {
'myproject.pipelines.CustomPipeline': 300,
}
您现在可以编写管道并处理您的项目.
You can now write your pipeline and play with your item.
在 pipeline.py 文件中:
from scrapy.exceptions import DropItem
class CustomPipeline(object):
def __init__(self):
# Create your database connection
def process_item(self, item, spider):
# Here you can index your item
return item
最后,在您的蜘蛛中,一旦物品被填满,就需要yield
.
Finally, in your spider, you need to yield
your item once it is filled.
spider.py 示例:
import scrapy
from myspider.items import Product
class MySpider(scrapy.Spider):
name = "test"
start_urls = [
'http://www.exemple.com',
]
def parse(self, response):
doc = Product()
doc['url'] = response.url
doc['title'] = response.xpath('//div/p/text()')
yield doc # Will go to your pipeline
希望这会有所帮助,这是管道的文档:项目管道
Hope this helps, here is the doc for pipelines: Item Pipeline
这篇关于Scrapy:如何在Spider中使用项目以及如何将项目发送到管道?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!