Scrapy 不抓取下一页 url [英] Scrapy is not Crawling the next page url

查看:43
本文介绍了Scrapy 不抓取下一页 url的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的蜘蛛没有抓取第 2 页,但 XPath 正在返回正确的下一页链接,这是到下一页的绝对链接.

My spider is not crawling the page 2 but the XPath is returning the correct next page link which is an absolute link to next page.

这是我的代码

from scrapy import Spider
from scrapy.http import Request, FormRequest



class MintSpiderSpider(Spider):

    name = 'Mint_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/']

    def parse(self, response):
        urls =  response.xpath('//div[@class = "post-inner post-hover"]/h2/a/@href').extract()

        for url in urls:
            yield Request(url, callback=self.parse_lyrics)

        next_page_url = response.xpath('//li[@class="next right"]/a/@href').extract_first()
        if next_page_url:
            yield scrapy.Request(next_page_url, callback=self.parse)


    def parse_foo(self, response):
        info = response.xpath('//*[@class="songinfo"]/p/text()').extract()
        name =  response.xpath('//*[@id="lyric"]/h2/text()').extract()

        yield{
            'name' : name,
            'info': info
        }

推荐答案

问题是next_page_url是一个列表,需要是一个url作为字符串.您需要使用 extract_first() 函数而不是 extract()next_page_url = response.xpath('//li[@class="next right"]/a/@href').extract().

The problem is that next_page_url is a list, and it needs to be an url as a string. You need to use the extract_first() function instead of extract() in next_page_url = response.xpath('//li[@class="next right"]/a/@href').extract().

更新

您必须import scrapy,因为您使用的是yield scrapy.Request(next_page_url, callback=self.parse)

You have to import scrapy since you are using yield scrapy.Request(next_page_url, callback=self.parse)

这篇关于Scrapy 不抓取下一页 url的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆