如何将URL提供给scrapy进行抓取? [英] How to give URL to scrapy for crawling?

查看:85
本文介绍了如何将URL提供给scrapy进行抓取?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用scrapy来抓取网页.有没有办法从终端本身传递起始 URL?

I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself?

它在文档中给出可以提供蜘蛛的名称或网址,但是当我提供网址时,它会引发错误:

It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an error:

//我的蜘蛛的名字是例子,但我给出的是url而不是我的蜘蛛名(如果我给蜘蛛名就可以了).

//name of my spider is example, but i am giving url instead of my spider name(It works fine if i give spider name).

scrapy crawl example.com

scrapy crawl example.com

错误:

文件"/usr/local/lib/python2.7/dist-packages/Scrapy-0.14.1-py2.7.egg/scrapy/spidermanager.py",第 43 行,在创建中raise KeyError("Spider not found: %s" % spider_name) KeyError: 'Spider not found: example.com'

File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.14.1-py2.7.egg/scrapy/spidermanager.py", line 43, in create raise KeyError("Spider not found: %s" % spider_name) KeyError: 'Spider not found: example.com'

我怎样才能让scrapy在终端给出的url上使用我的蜘蛛??

How can i make scrapy to use my spider on the url given in the terminal??

推荐答案

我不太确定命令行选项.但是,您可以这样编写您的蜘蛛.

I'm not really sure about the commandline option. However, you could write your spider like this.

class MySpider(BaseSpider):

    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      self.start_urls = [kwargs.get('start_url')] 

然后像这样开始:scrapy 爬取 my_spider -a start_url="http://some_url"

这篇关于如何将URL提供给scrapy进行抓取?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆