如何解决刮擦的不支持的URL方案错误? [英] How do I fix scrapy Unsupported URL scheme error?

查看:69
本文介绍了如何解决刮擦的不支持的URL方案错误?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从命令python收集url,然后将其插入start_urls

I collect url from command python and then insert it into start_urls

from flask import Flask, jsonify, request
import scrapy
import subprocess

class ClassSpider(scrapy.Spider):
    name        = 'mySpider'
    #start_urls = []
    #pages      = 0
    news        = []

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.start_urls = []
        self.start_urlsappend(url)

    def parse(self):
        ...

    def run(self):
        subprocess.check_output(['scrapy', 'crawl', 'mySpider', '-a', f'url={self.start_urls}', '-a', f'nbrPage={self.pages}'])
        return self.news

app = Flask(__name__)
data = []

@app.route('/', methods=['POST'])
def getNews():
    mySpiderClass = ClassSpider(request.json['url'], 2)
    return jsonify({'data': mySpider.run()})

if __name__ == "__main__":
    app.run(debug=True)

我收到此错误:不支持加注(不支持的URL方案%s:%s"%scrapy.exceptions.NotSupported:URL方案不受支持:该方案没有可用的处理程序

I got this error: raise not supported("unsupported url scheme %s: %s" % scrapy.exceptions.NotSupported: Unsupported URL scheme '': no handler available for that scheme

当我放一个 print('my urls list:'+ str(self.start_urls)),它打印url列表,例如-> my urls list:['www.googole.com']

When I put a print('my urls List: ' + str(self.start_urls)), it prints a list of url like --> my urls List: ['www.googole.com']

任何帮助plz

推荐答案

我想发生这种情况是因为您先将 url 附加到 self.start_urls ,然后调用 ClassSpider run 方法和列表 self.start_urls ,该方法又将列表追加到列表中,最终得到嵌套列表而不是列表字符串.
为了避免这种情况,您应该像这样更改您的 __ init __ 方法:

I guess this happens because you first append url to self.start_urls and then you call ClassSpiders run method with your list self.start_urls which in turn appends the list to a list and you end up with a nested list instead of a list of strings.
To avoid this you should maybe change your __init__ method like this:

    def __init__(self, url, nbrPage):
        self.pages      = nbrPage
        self.url        = url
        self.start_urls = []
        self.start_urls.append(url)

然后在 run 中传递 self.url 而不是 self.start_urls :

    def run(self):
        subprocess.check_output(['scrapy', 'crawl', 'mySpider', '-a', f'url={self.url}', '-a', f'nbrPage={self.pages}'])
        return self.news

这篇关于如何解决刮擦的不支持的URL方案错误?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆