Scrapy 无法使用 itemloader 抓取第二个页面 [英] Scrapy can not scrape a second page using itemloader

查看：34 发布时间：2021/7/16 22:12:22 python-2.7 scrapy scrapy-spider

本文介绍了Scrapy 无法使用 itemloader 抓取第二个页面的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

更新:7/29，晚上 9:29:阅读后这篇文章，我更新了我的代码.

Update: 7/29, 9:29pm: After reading this post, I updated my code.

更新:2015 年 7 月 28 日晚上 7 点 35 分，按照 Martin 的建议，消息发生了变化，但仍然没有项目列表或写入数据库.

UPDATE: 7/28/15, at 7:35pm, following Martin's suggestion, the message changed, but still no listing of items or writing to database.

ORIGINAL:我可以成功抓取单个页面(基页).现在我尝试使用请求和回调命令从基本"页面中找到的另一个 url 中抓取其中一个项目.但它不起作用.蜘蛛在这里:

ORIGINAL: I can successfully scrape a single page (the base page). Now I tried to scrape one of the items from another url found from the "base" page, using Request and callback command. But it does not work. The spider is here:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy import Request
import re
from datetime import datetime, timedelta
from CAPjobs.items import CAPjobsItem 
from CAPjobs.items import CAPjobsItemLoader
from scrapy.contrib.loader.processor import MapCompose, Join

class CAPjobSpider(Spider):
    name = "naturejob3"
    download_delay = 2
    #allowed_domains = ["nature.com/naturejobs/"]
    start_urls = [
    "http://www.nature.com/naturejobs/science/jobs?utf8=%E2%9C%93&q=pathologist&where=&commit=Find+Jobs"]

    def parse_subpage(self, response):
        il = response.meta['il']
        il.add_xpath('loc_pj', '//div[@id="extranav"]/div/dl/dd[2]/ul/li/text()')  
        yield il.load_item()

    def parse(self, response):
        hxs = Selector(response)
        sites = hxs.xpath('//div[@class="job-details"]')    

        for site in sites:

            il = CAPjobsItemLoader(CAPjobsItem(), selector = site) 
            il.add_xpath('title', 'h3/a/text()')
            il.add_xpath('post_date', 'normalize-space(ul/li[@class="when"]/text())')
            il.add_xpath('web_url', 'concat("http://www.nature.com", h3/a/@href)')
            url = il.get_output_value('web_url')
            yield Request(url, meta={'il': il}, callback=self.parse_subpage)

现在抓取部分运行，但没有 loc_pj 项目:(更新于 7/29，晚上 7:35)

Now the scraping is partially functioning, but no loc_pj item: (UPDATE on 7/29, 7:35pm)

2015-07-29 21:28:24 [scrapy] DEBUG: Scraped from <200 http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000>
{'post_date': u'21 days ago',
'title': u'Assistant, Associate, Full (HS Clinical, Clin X) - Anatomic Pathology/Cytopathology (11-000)',
'web_url': u'http://www.nature.com/naturejobs/science/jobs/535683-assistant-associate-full-hs-clinical-clin-x-anatomic-pathology-cytopathology-11-000'}

Scrapy 无法使用 itemloader 抓取第二个页面 [英] Scrapy can not scrape a second page using itemloader

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Scrapy 无法使用 itemloader 抓取第二个页面 [英] Scrapy can not scrape a second page using itemloader

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭