Scrapy: ValueError('请求 url 中缺少方案:%s' % self._url) [英] Scrapy: ValueError('Missing scheme in request url: %s' % self._url)
问题描述
我正在尝试从网页中抓取数据.该网页只是一个包含 2500 个 URL 的项目符号列表.Scrapy fetch 并转到每个 URL 并获取一些数据......
I am trying to scrape data from a webpage. The webpage is simply a bullet list of 2500 URLs. Scrapy fetch and goes to each and every URL and fetch some data ...
这是我的代码
class MySpider(CrawlSpider):
name = 'dknews'
start_urls = ['http://www.example.org/uat-area/scrapy/all-news-listing']
allowed_domains = ['example.org']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
ptype = soup.find_all(attrs={"name":"dkpagetype"})
ptitle = soup.find_all(attrs={"name":"dkpagetitle"})
pturl = soup.find_all(attrs={"name":"dkpageurl"})
ptdate = soup.find_all(attrs={"name":"dkpagedate"})
ptdesc = soup.find_all(attrs={"name":"dkpagedescription"})
for node in soup.find_all("div", class_="module_content-panel-sidebar-content"):
ptbody = ''.join(node.find_all(text=True))
ptbody = ' '.join(ptbody.split())
nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
nf['bodytext'] = ptbody.encode('ascii', 'ignore')
yield nf
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
yield Request(url, callback=self.parse)
现在的问题是,上面的代码从 2500 篇文章中抓取了大约 215 篇.它通过给出此错误关闭...
Now the problem is that the above code scrapes around 215 out of 2500 articles. It closes by giving this error ...
ValueError('请求 url 中缺少方案:%s' % self._url)
ValueError('Missing scheme in request url: %s' % self._url)
我不知道是什么导致了这个错误......
I have no idea what is causing this error ....
非常感谢任何帮助.
谢谢
推荐答案
更新 01/2019
现在 Scrapy 的 Response 实例有非常方便的方法 response.follow
,它从给定的 URL 生成请求(绝对或相对,甚至 LinkExtractor 生成的
) 使用 Link
对象response.url
作为基础:
Nowdays Scrapy's Response instance has pretty convenient method response.follow
which generates Request from the given URL (either absolute or relative or even Link
object generated by LinkExtractor
) using response.url
as the base:
yield response.follow('some/url', callback=self.parse_some_url, headers=headers, ...)
文档:http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.follow
下面的代码看起来像问题:
Code below looks like the issue:
for url in hxs.xpath('//ul[@class="scrapy"]/li/a/@href').extract():
yield Request(url, callback=self.parse)
如果任何 url 不是完全限定的,例如看起来像 href="/path/to/page"
而不是 href="http://example.com/path/to/page"
你会得到错误.为了确保您产生正确的请求,您可以使用 urljoin
:
if any of urls is not fully qualified, e.g. looks like href="/path/to/page"
rather than href="http://example.com/path/to/page"
you'll get the error. To ensure you yielding correct requests you can use urljoin
:
yield Request(response.urljoin(url), callback=self.parse)
Scrapy 方法是使用 LinkExtractor
虽然 https://doc.scrapy.org/en/latest/topics/link-extractors.html
Scrapy way is to use LinkExtractor
though https://doc.scrapy.org/en/latest/topics/link-extractors.html
这篇关于Scrapy: ValueError('请求 url 中缺少方案:%s' % self._url)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!