Scrapy - ValueError:请求 url 中缺少方案:#mw-head [英] Scrapy - ValueError: Missing scheme in request url: #mw-head
问题描述
我得到以下回溯,但不确定如何重构.
I'm getting the following traceback but unsure how to refactor.
ValueError: Missing scheme in request url: #mw-head
完整代码:
class MissleSpiderBio(scrapy.Spider):
name = 'missle_spider_bio'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/...']
这是给我带来问题的部分(我相信)
this is the part giving me issues (I believe)
def parse(self, response):
filename = response.url.split('/')[-1]
table = response.xpath('///div/table[2]/tbody')
rows = table.xpath('//tr')
row = rows[2]
row.xpath('td//text()')[0].extract()
wdata = {}
for row in response.xpath('//* \
[@class="wikitable"]//tbody//tr'):
for link in response.xpath('//a/@href'):
link = link.extract()
if((link.strip() != '')):
yield Request(link, callback=self.parse)
#wdata.append(link)
else:
yield None
#wdata = {}
#wdata['link'] = BASE_URL +
#row.xpath('a/@href').extract() #[0]
wdata['link'] = BASE_URL + link
request = scrapy.Request(wdata['link'],\
callback=self.get_mini_bio, dont_filter=True)
request.meta['item'] = MissleItem(**wdata)
yield request
这是代码的第二部分:
def get_mini_bio(self, response):
BASE_URL_ESCAPED = 'http:\/\/en.wikipedia.org'
item = response.meta['item']
item['image_urls'] = []
img_src = response.xpath('//table[contains(@class, \
"infobox")]//img/@src')
if img_src:
item['image_urls'] = ['http:' + img_src[0].extract()]
mini_bio = ''
paras = response.xpath('//*[@id="mw-content-text"]/p[text()\
or normalize-space(.)=""]').extract()
for p in paras:
if p =='<p></p>':
break
mini_bio += p
mini_bio = mini_bio.replace('href="/wiki', 'href="' + \
BASE_URL + '/wiki')
mini_bio = mini_bio.replace('href="#', item['link'] + '#')
item['mini_bio'] = mini_bio
yield item
我尝试重构,但现在得到:
I tried refactoring but am now getting a:
ValueError: Missing scheme in request url: #mw-head
非常感谢任何帮助
推荐答案
row.xpath('a/@href').extract()
该表达式的计算结果是一个列表而不是一个字符串.当你将 URL 传递给请求对象时,scrapy 需要一个字符串,而不是一个列表
That expression evaluates to a list NOT a string. When you pass the URL to the request object, scrapy expects a string, not a list
要解决此问题,您有几个选择:您可以使用 LinkExtractors,它允许您搜索页面中的链接并自动为这些链接创建抓取请求对象:
To fix this, you have a few options: You can use LinkExtractors which will allow you to search a page for links and automatically create scrapy request objects for those links:
https://doc.scrapy.org/en/latest/topics/link-extractors.html
或您可以运行 for 循环并浏览每个链接:
OR You could run a for loop and go through each of the links:
从scrapy.spiders导入请求
from scrapy.spiders import Request
for link in response.xpath('//a/@href'):
link = link.extract()
if((link.strip() != '')):
yield Request(link, callback=self.parse)
else:
yield None
您可以在该代码中添加任何您想要的字符串过滤器
You can add whatever string filters you want to that code
或
如果你只想要第一个链接,你可以使用 .extract_first()
而不是 .extract()
If you just want the first link, you can use .extract_first()
instead of .extract()
这篇关于Scrapy - ValueError:请求 url 中缺少方案:#mw-head的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!