Scrapy - ValueError:请求 url 中缺少方案:#mw-head [英] Scrapy - ValueError: Missing scheme in request url: #mw-head

查看:61
本文介绍了Scrapy - ValueError:请求 url 中缺少方案:#mw-head的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我得到以下回溯,但不确定如何重构.

I'm getting the following traceback but unsure how to refactor.

ValueError: Missing scheme in request url: #mw-head

完整代码:

class MissleSpiderBio(scrapy.Spider): 

    name = 'missle_spider_bio'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/...']

这是给我带来问题的部分(我相信)

this is the part giving me issues (I believe)

    def parse(self, response):
        filename = response.url.split('/')[-1]
        table = response.xpath('///div/table[2]/tbody')
        rows = table.xpath('//tr')
        row = rows[2]
        row.xpath('td//text()')[0].extract()
        wdata = {}
        for row in response.xpath('//* \
        [@class="wikitable"]//tbody//tr'):
            for link in response.xpath('//a/@href'):
                link = link.extract()
                if((link.strip() != '')):
                    yield Request(link, callback=self.parse)
                    #wdata.append(link)
                else:
                    yield None
                #wdata = {}
                #wdata['link'] = BASE_URL + 
                #row.xpath('a/@href').extract() #[0]
                wdata['link'] = BASE_URL + link 
                request = scrapy.Request(wdata['link'],\
                callback=self.get_mini_bio, dont_filter=True) 
                request.meta['item'] = MissleItem(**wdata)
                yield request

这是代码的第二部分:

    def get_mini_bio(self, response):
        BASE_URL_ESCAPED = 'http:\/\/en.wikipedia.org'
        item = response.meta['item']
        item['image_urls'] = [] 
        img_src = response.xpath('//table[contains(@class, \ 
        "infobox")]//img/@src')
        if img_src:
            item['image_urls'] = ['http:' + img_src[0].extract()]
        mini_bio = ''
        paras = response.xpath('//*[@id="mw-content-text"]/p[text()\ 
        or  normalize-space(.)=""]').extract()
        for p in paras:
            if p =='<p></p>':
                break
            mini_bio += p

        mini_bio = mini_bio.replace('href="/wiki', 'href="' + \ 
        BASE_URL + '/wiki')
        mini_bio = mini_bio.replace('href="#', item['link'] + '#')
        item['mini_bio'] = mini_bio
        yield item 

我尝试重构,但现在得到:

I tried refactoring but am now getting a:

ValueError: Missing scheme in request url: #mw-head

非常感谢任何帮助

推荐答案

row.xpath('a/@href').extract()

该表达式的计算结果是一个列表而不是一个字符串.当你将 URL 传递给请求对象时,scrapy 需要一个字符串,而不是一个列表

That expression evaluates to a list NOT a string. When you pass the URL to the request object, scrapy expects a string, not a list

要解决此问题,您有几个选择:您可以使用 LinkExtractors,它允许您搜索页面中的链接并自动为这些链接创建抓取请求对象:

To fix this, you have a few options: You can use LinkExtractors which will allow you to search a page for links and automatically create scrapy request objects for those links:

https://doc.scrapy.org/en/latest/topics/link-extractors.html

或您可以运行 for 循环并浏览每个链接:

OR You could run a for loop and go through each of the links:

从scrapy.spiders导入请求

from scrapy.spiders import Request

for link in response.xpath('//a/@href'):
    link = link.extract()
    if((link.strip() != '')):
        yield Request(link, callback=self.parse)
    else:
        yield None

您可以在该代码中添加任何您想要的字符串过滤器

You can add whatever string filters you want to that code

如果你只想要第一个链接,你可以使用 .extract_first() 而不是 .extract()

If you just want the first link, you can use .extract_first() instead of .extract()

这篇关于Scrapy - ValueError:请求 url 中缺少方案:#mw-head的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆