Scrapy 无法使用 itemloader 抓取第二个页面 [英] Scrapy can not scrape a second page using itemloader

查看:34
本文介绍了Scrapy 无法使用 itemloader 抓取第二个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

更新:7/29,晚上 9:29:阅读后 这篇文章,我更新了我的代码.

Update: 7/29, 9:29pm: After reading this post, I updated my code.

更新:2015 年 7 月 28 日晚上 7 点 35 分,按照 Martin 的建议,消息发生了变化,但仍然没有项目列表或写入数据库.

UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database.

ORIGINAL:我可以成功抓取单个页面(基页).现在我尝试使用请求和回调命令从基本"页面中找到的另一个 url 中抓取其中一个项目.但它不起作用.蜘蛛在这里:

ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem 
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join

class CAPjobSpider(Spider):
    name = "naturejob3"
    download_delay = 2
    #allowed_domains = ["nature.com/naturejobs/"]
    start_urls = [
    "http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]

    def parse_subpage(self, response):
        il = response.meta['il']
        il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')  
        yield il.load_item()

    def parse(self, response):
        hxs = Selector(response)
        sites = hxs.xpath('//div[@class="job-details"]')    

        for site in sites:

            il = CAPjobsItemLoader(CAPjobsItem(), selector = site) 
            il.add_xpath('title', 'h3/a/text()')
            il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
            il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
            url = il.get_output_value('web_url')
            yield Request(url, meta={'il': il}, callback=self.parse_subpage)

现在抓取部分运行,但没有 loc_pj 项目:(更新于 7/29,晚上 7:35)

Now the scraping is partially functioning, but no loc_pj item: (UPDATE on 7/29, 7:35pm)

2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}

推荐答案

您可以像这样初始化 ItemLoader:

You initialize the ItemLoader like so:

il = CAPjobsItemLoader(CAPjobsItem, sites)

文档中 是这样完成的:

l = ItemLoader(item=Product(), response=response)

所以我认为您在 CAPjobsItem 处缺少括号,您的行应为:

So I think you're missing parentheses at the CAPjobsItem and your line should read:

il = CAPjobsItemLoader(CAPjobsItem(), sites)

这篇关于Scrapy 无法使用 itemloader 抓取第二个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆