抓取包含锚标记的网页 <a href = "#>使用scrapy [英] scraping web page containing anchor tag <a href = "#"> using scrapy

查看：58 发布时间：2021/7/16 22:19:16 javascript python web-scraping scrapy scrapy-splash

本文介绍了抓取包含锚标记的网页 <a href = "#>使用scrapy的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在抓取 manulife

我想转到下一页，当我检查下一页"时，我得到:

I want to go to the next page, when I inspect the "next" I get :

<span class="pagerlink">
    <a href="#" id="next" title="Go to the next page">Next</a>
</span>

应该遵循的正确方法是什么?

What could be the right approach to follow?

# -*- coding: utf-8 -*-
import scrapy
import json
from scrapy_splash import SplashRequest

class Manulife(scrapy.Spider):
    name = 'manulife'
    #allowed_domains = ['https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en']
    start_urls = ['https://manulife.taleo.net/careersection/external_global/jobsearch.ftl?lang=en&location=1038']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
            url,
            self.parse,
            args={'wait': 5},
            )   

    def parse(self, response):
        #yield {
        #   'demo' : response.css('div.absolute > span > a::text').extract()
        #     }
        urls = response.css('div.absolute > span > a::attr(href)').extract() 
        for url in urls:
            url = "https://manulife.taleo.net" + url
            yield SplashRequest(url = url, callback = self.parse_details, args={'wait': 5})
            #self.log("reaced22 : "+ url)

        #hitting next button
        #data = json.loads(response.text)
        #self.log("reached 22 : "+ data)
        #next_page_url = 

        if next_page_url:
           next_page_url = response.urljoin(next_page_url) 
           yield SplashRequest(url = next_page_url, callback = self.parse, args={'wait': 5})

    def parse_details(self,response):
        yield {
           'Job post' : response.css('div.contentlinepanel > span.titlepage::text').extract(),
           'Location' : response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1679.row1']/text()").extract(),
           'Organization' : response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1787.row1']/text()").extract(),
           'Date posted' : response.xpath("//span[@id = 'requisitionDescriptionInterface.reqPostingDate.row1']/text()").extract(),
           'Industry': response.xpath("//span[@id = 'requisitionDescriptionInterface.ID1951.row1']/text()").extract()
          }

如您所见，该代码包含点击下一页链接时的 SplashRequest.

As you can see, the code contains the SplashRequest while hitting the next page link.

我是抓取的新手，在某处我发现该网站也可以将响应作为 json 返回.我试过了，但它给我的错误是无法解码 json 对象"

I am novice in scraping, somewhere I found that website can return the response as json also. I tried it , but it is giving me the error that " No json object could be decoded"

抓取包含锚标记的网页 <a href = "#>使用scrapy [英] scraping web page containing anchor tag <a href = "#"> using scrapy

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

抓取包含锚标记的网页 &lt;a href = &quot;#&gt;使用scrapy [英] scraping web page containing anchor tag &lt;a href = &quot;#&quot;&gt; using scrapy

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

抓取包含锚标记的网页 <a href = "#>使用scrapy [英] scraping web page containing anchor tag <a href = "#"> using scrapy

登录关闭