Scrapy 无法使用 itemloader 抓取第二个页面 [英] Scrapy can not scrape a second page using itemloader
问题描述
更新:7/29,晚上 9:29:阅读后 这篇文章,我更新了我的代码.
Update: 7/29, 9:29pm: After reading this post, I updated my code.
更新:2015 年 7 月 28 日晚上 7 点 35 分,按照 Martin 的建议,消息发生了变化,但仍然没有项目列表或写入数据库.
UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database.
ORIGINAL:我可以成功抓取单个页面(基页).现在我尝试使用请求和回调命令从基本"页面中找到的另一个 url 中抓取其中一个项目.但它不起作用.蜘蛛在这里:
ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join
class CAPjobSpider(Spider):
name = "naturejob3"
download_delay = 2
#allowed_domains = ["nature.com/naturejobs/"]
start_urls = [
"http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]
def parse_subpage(self, response):
il = response.meta['il']
il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')
yield il.load_item()
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//div[@class="job-details"]')
for site in sites:
il = CAPjobsItemLoader(CAPjobsItem(), selector = site)
il.add_xpath('title', 'h3/a/text()')
il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
url = il.get_output_value('web_url')
yield Request(url, meta={'il': il}, callback=self.parse_subpage)
现在抓取部分运行,但没有 loc_pj
项目:(更新于 7/29,晚上 7:35)
Now the scraping is partially functioning, but no loc_pj
item: (UPDATE on 7/29, 7:35pm)
2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}
推荐答案
您可以像这样初始化 ItemLoader
:
You initialize the ItemLoader
like so:
il = CAPjobsItemLoader(CAPjobsItem, sites)
在文档中 是这样完成的:
l = ItemLoader(item=Product(), response=response)
所以我认为您在 CAPjobsItem
处缺少括号,您的行应为:
So I think you're missing parentheses at the CAPjobsItem
and your line should read:
il = CAPjobsItemLoader(CAPjobsItem(), sites)
这篇关于Scrapy 无法使用 itemloader 抓取第二个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!