Scrapy:提取链接和文本 [英] Scrapy: Extract links and text
问题描述
我是 scrapy 的新手,我正在尝试抓取宜家网站网页.带有位置列表的基本页面此处.
I am new to scrapy and I am trying to scrape the Ikea website webpage. The basic page with the list of locations as given here.
我的 items.py 文件如下:
import scrapy
class IkeaItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
蜘蛛如下:
import scrapy
from ikea.items import IkeaItem
class IkeaSpider(scrapy.Spider):
name = 'ikea'
allowed_domains = ['http://www.ikea.com/']
start_urls = ['http://www.ikea.com/']
def parse(self, response):
for sel in response.xpath('//tr/td/a'):
item = IkeaItem()
item['name'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
yield item
在运行文件时,我没有得到任何输出.json 文件输出类似于:
On running the file I am not getting any output. The json file output is something like:
[[{"link": [], "name": []}
我正在寻找的输出是位置和链接的名称.我什么也得不到.我哪里出错了?
The output that I am looking for is the name of the location and the link. I am getting nothing. Where am I going wrong?
推荐答案
在项目字段的 xpath 表达式中有一个简单的错误.循环已经遍历了 a
标签,您不需要在内部 xpath 表达式中指定 a
.换句话说,当前您正在tr
内的td
内的a
标记内搜索a
标记.这显然导致什么都没有.
There is a simple mistake inside the xpath expressions for the item fields. The loop is already going over the a
tags, you don't need to specify a
in the inner xpath expressions. In other words, currently you are searching for a
tags inside the a
tags inside the td
inside tr
. Which obviously results into nothing.
将 a/text()
替换为 text()
,将 a/@href
替换为 @href
.
Replace a/text()
with text()
and a/@href
with @href
.
(经过测试 - 对我有用)
(tested - works for me)
这篇关于Scrapy:提取链接和文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!