从Python的脚本传递参数给Scrapy蜘蛛 [英] Passing Argument to Scrapy Spider from Python Script
问题描述
我只提一些我已经(发布这个问题之前我目前还没有链接到所有的,我提到了这些问题,)发布这个问题之前所提问题 - :
I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-:
- Question 1
- Question 2
我能够完全运行这个code,如果我不传递参数,并要求从BBSpider类(用户的输入,而不主要功能 - 乌斯名=DMOZ的线下),或者为他们提供的pre定义(即静态的)参数。
I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments.
我的code是这里。
我基本上试图执行从一个Python脚本一个Scrapy蜘蛛无任何附加文件的要求(即使设置文件)。这就是为什么,我已经指定了$ C $内还设置C本身。
I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself.
这是我上执行这个脚本得到的输出 -
This is the output that I am getting on executing this script-:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines:
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)
这是我目前所面临的问题 - :
The problems that I am currently facing-:
- 如果你仔细看,1号线和我输出的6代线,我传递给我的蜘蛛得到了两次打印的START_URL,即使我已经在我的code的第31行(其链接书面打印语句一次我上面了)。这是为什么发生,太值不同(1号线(我的输出的初始打印语句输出)给出正确的结果,但对6号线(我的输出)?print语句输出不仅如此,即使我写 - 打印'HI' - 那也是它被打印两次这究竟是为什么 ?
- 接下来,如果你看到这条线我的输出 - 的: 类型错误:请求的URL必须是海峡或单向code,得到NoneType:这是为什么来了(尽管那我张贴以上问题的环节,都写同样的事情)?我不知道如何解决它?我甚至尝试`self.start_urls = [STR(kwargs.get('START_URL'))]` - 那么它提供了以下输出 - :
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines:
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)
请帮我解决上述2个错误。
Please help me resolve the above 2 errors.
推荐答案
您需要通过对您的参数抓取
的方法 CrawlerProcess
,所以你需要这样运行的:
You need to pass your parameters on the crawl
method of the CrawlerProcess
, so you need to run it like this:
crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()
这篇关于从Python的脚本传递参数给Scrapy蜘蛛的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!