Scrapy: ValueError('请求 url 中缺少方案:%s' % self._url) [英] Scrapy: ValueError('Missing scheme in request url: %s' % self._url)

查看:20
本文介绍了Scrapy: ValueError('请求 url 中缺少方案:%s' % self._url)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从网页中抓取数据.该网页只是一个包含 2500 个 URL 的项目符号列表.Scrapy fetch 并转到每个 URL 并获取一些数据......

I am trying to scrape data from a webpage. The webpage is simply a bullet list of 2500 URLs. Scrapy fetch and goes to each and every URL and fetch some data ...

这是我的代码

class MySpider(CrawlSpider):
    name = 'dknews'
    start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
    allowed_domains = ['example.org']

    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"dkpagetype"})
        ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
        pturl = soup.find_all(attrs={"name":"dkpageurl"})
        ptdate = soup.find_all(attrs={"name":"dkpagedate"})
        ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
         for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
           ptbody = ''.join(node.find_all(text=True))  
           ptbody = ' '.join(ptbody.split())
           nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
           nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
           nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
           nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
           nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
           nf['bodytext'] = ptbody.encode('ascii', 'ignore')
         yield nf
            for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
             yield Request(url, callback=self.parse)

现在的问题是,上面的代码从 2500 篇文章中抓取了大约 215 篇.它通过给出此错误关闭...

Now the problem is that the above code scrapes around 215 out of 2500 articles. It closes by giving this error ...

ValueError('请求 url 中缺少方案:%s' % self._url)

ValueError('Missing scheme in request url: %s' % self._url)

我不知道是什么导致了这个错误......

I have no idea what is causing this error ....

非常感谢任何帮助.

谢谢

推荐答案

更新 01/2019

现在 Scrapy 的 Response 实例有非常方便的方法 response.follow,它从给定的 URL 生成请求(绝对或相对,甚至 LinkExtractor 生成的 Link 对象) 使用 response.url 作为基础:

Nowdays Scrapy's Response instance has pretty convenient method response.follow which generates Request from the given URL (either absolute or relative or even Link object generated by LinkExtractor) using response.url as the base:

yield response.follow('some/url', callback=self.parse_some_url, headers=headers, ...)

文档:http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow

下面的代码看起来像问题:

Code below looks like the issue:

 for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
     yield Request(url, callback=self.parse)

如果任何 url 不是完全限定的,例如看起来像 href="/path/to/page" 而不是 href="http://example.com/path/to/page" 你会得到错误.为了确保您产生正确的请求,您可以使用 urljoin:

if any of urls is not fully qualified, e.g. looks like href="/path/to/page" rather than href="http://example.com/path/to/page" you'll get the error. To ensure you yielding correct requests you can use urljoin:

    yield Request(response.urljoin(url), callback=self.parse)

Scrapy 方法是使用 LinkExtractor 虽然 https://doc.scrapy.org/en/latest/topics/link-extractors.html

Scrapy way is to use LinkExtractor though https://doc.scrapy.org/en/latest/topics/link-extractors.html

这篇关于Scrapy: ValueError('请求 url 中缺少方案:%s' % self._url)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆