Scrapy 返回多个项目 [英] Scrapy Return Multiple Items
问题描述
我是新来的Scrapy,我真的只是失去了我如何能在一个块中返回多个项目.
I'm new to Scrapy and I'm really just lost on how i can return multiple items in one block.
基本上,我得到了一个 HTML 标签,它有一个引用,其中包含嵌套的文本标签、作者姓名和一些关于该引用的标签.
Basically, I'm getting one HTML tag which has a quote that contains nested tags of text, author name, and some tags about that quote.
此处的代码仅返回一个引号,仅此而已.它不使用循环返回其余部分.我已经在网上搜索了几个小时,但我很绝望我没有得到它.到目前为止,这是我的代码:
The code here only returns one quote and that's it. It doesnt use the loop to return the rest. I've been searching the web for hours and I'm just hopeless I don't get it. Here's my code so far:
Spider.py
import scrapy
from scrapy.loader import ItemLoader
from first_spider.items import FirstSpiderItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
l = ItemLoader(item = FirstSpiderItem(), response=response)
quotes = response.xpath("//*[@class='quote']")
for quote in quotes:
text = quote.xpath(".//span[@class='text']/text()").extract_first()
author = quote.xpath(".//small[@class='author']/text()").extract_first()
tags = quote.xpath(".//meta[@class='keywords']/@content").extract_first()
# removes quotation marks from the text
for c in ['"', '"']:
if c in text:
text = text.replace(c, "")
l.add_value('text', text)
l.add_value('author', author)
l.add_value('tags', tags)
return l.load_item()
next_page_path =
response.xpath(".//li[@class='next']/a/@href").extract_first()
next_page_url = response.urljoin(next_page_path)
yield scrapy.Request(next_page_url)
Items.py
import scrapy
class FirstSpiderItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
这是我要抓取的页面:
推荐答案
我也在寻找相同问题的解决方案.这是我找到的解决方案:
I was also searching for a solution for the same problem. And here is the solution that I have found:
def parse(self, response):
for selector in response.xpath("//*[@class='quote']"):
l = ItemLoader(item=FirstSpiderItem(), selector=selector)
l.add_xpath('text', './/span[@class="text"]/text()')
l.add_xpath('author', '//small[@class="author"]/text()')
l.add_xpath('tags', './/meta[@class="keywords"]/@content')
yield l.load_item()
next_page = response.xpath(".//li[@class='next']/a/@href").extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
要从文本中删除引号,您可以使用 items.py 中的输出处理器.
To remove quotation marks from the text, you can use an output processor in items.py.
from scrapy.loader.processors import MapCompose
def replace_quotes(text):
for c in ['"', '"']:
if c in text:
text = text.replace(c, "")
return text
class FirstSpiderItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field(output_processor=MapCompose(replace_quotes))
请告诉我它是否有帮助.
Please let me know whether it was helpful.
这篇关于Scrapy 返回多个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!