从Python的脚本传递参数给Scrapy蜘蛛 [英] Passing Argument to Scrapy Spider from Python Script

查看:5979
本文介绍了从Python的脚本传递参数给Scrapy蜘蛛的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只提一些我已经(发布这个问题之前我目前还没有链接到所有的,我提到了这些问题,)发布这个问题之前所提问题 - :

I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-:

  • Question 1
  • Question 2

我能够完全运行这个code,如果我不传递参数,并要求从BBSpider类(用户的输入,而不主要功能 - 乌斯名=DMOZ的线下),或者为他们提供的pre定义(即静态的)参数。

I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments.

我的code是这里

我基本上试图执行从一个Python脚本一个Scrapy蜘蛛无任何附加文件的要求(即使设置文件)。这就是为什么,我已经指定了$ C $内还设置C本身。

I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself.

这是我上执行这个脚本得到的输出 -

This is the output that I am getting on executing this script-:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

这是我目前所面临的问题 - :

The problems that I am currently facing-:

  • 如果你仔细看,1号线和我输出的6代线,我传递给我的蜘蛛得到了两次打印的START_URL,即使我已经在我的code的第31行(其链接书面打印语句一次我上面了)。这是为什么发生,太值不同(1号线(我的输出的初始打印语句输出)给出正确的结果,但对6号线(我的输出)?print语句输出不仅如此,即使我写 - 打印'HI' - 那也是它被打印两次这究竟是为什么
  • 接下来,如果你看到这条线我的输出 - 的:    类型错误:请求的URL必须是海峡或单向code,得到NoneType:这是为什么来了(尽管那我张贴以上问题的环节,都写同样的事情)?我不知道如何解决它?我甚至尝试`self.start_urls = [STR(kwargs.get('START_URL'))]` - 那么它提供了以下输出 - :
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

请帮我解决上述2个错误。

Please help me resolve the above 2 errors.

推荐答案

您需要通过对您的参数抓取的方法 CrawlerProcess ,所以你需要这样运行的:

You need to pass your parameters on the crawl method of the CrawlerProcess, so you need to run it like this:

crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()

这篇关于从Python的脚本传递参数给Scrapy蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆