从Python的脚本传递参数给Scrapy蜘蛛 [英] Passing Argument to Scrapy Spider from Python Script

查看：5979 发布时间：2016/5/29 12:17:02 python web-scraping arguments scrapy scrapy-spider

本文介绍了从Python的脚本传递参数给Scrapy蜘蛛的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我只提一些我已经（发布这个问题之前我目前还没有链接到所有的，我提到了这些问题，）发布这个问题之前所提问题 - ：

I am mentioning only SOME of the questions that I have referred before posting this question (I currently don't have links to all of those questions that I had referred to, before posting this question)-:

Question 1
Question 2

我能够完全运行这个code，如果我不传递参数，并要求从BBSpider类（用户的输入，而不主要功能 - 乌斯名=DMOZ的线下），或者为他们提供的pre定义（即静态的）参数。

I am able to run this code completely, if I don't pass the arguments and ask for an input from the user from the BBSpider Class (without the main function - ust below the name="dmoz" line), or provide them as pre-defined (i.e, static) arguments.

我的code是这里。

我基本上试图执行从一个Python脚本一个Scrapy蜘蛛无任何附加文件的要求（即使设置文件）。这就是为什么，我已经指定了$ C $内还设置C本身。

I am basically trying to execute a Scrapy spider from a Python Script without the requirement of any additional files (even the Settings File). That is why, I have specified the settings also inside the code itself.

这是我上执行这个脚本得到的输出 -

This is the output that I am getting on executing this script-:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

这是我目前所面临的问题 - ：

The problems that I am currently facing-:

如果你仔细看，1号线和我输出的6代线，我传递给我的蜘蛛得到了两次打印的START_URL，即使我已经在我的code的第31行（其链接书面打印语句一次我上面了）。这是为什么发生，太值不同（1号线（我的输出的初始打印语句输出）给出正确的结果，但对6号线（我的输出）？print语句输出不仅如此，即使我写 - 打印'HI' - 那也是它被打印两次这究竟是为什么
接下来，如果你看到这条线我的输出 - 的：类型错误：请求的URL必须是海峡或单向code，得到NoneType：这是为什么来了（尽管那我张贴以上问题的环节，都写同样的事情）？我不知道如何解决它？我甚至尝试`self.start_urls = [STR（kwargs.get（'START_URL'））]` - 那么它提供了以下输出 - ：

http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines: 
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
    request = next(slot.start_requests)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
    yield self.make_requests_from_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
    return Request(url, dont_filter=True)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
    self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
 'log_count/DEBUG': 1,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

请帮我解决上述2个错误。

Please help me resolve the above 2 errors.

从Python的脚本传递参数给Scrapy蜘蛛 [英] Passing Argument to Scrapy Spider from Python Script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从Python的脚本传递参数给Scrapy蜘蛛 [英] Passing Argument to Scrapy Spider from Python Script

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭