在python脚本中将参数传递给scrapy spider [英] Pass argument to scrapy spider within a python script
问题描述
我可以使用来自 wiki 的以下配方在 python 脚本中运行爬行:
I can run crawl in a python script with the following recipe from wiki :
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
如您所见,我可以将 domain
传递给 FollowAllSpider
但我的问题是如何传递 start_urls
(实际上是一个date
将被添加到一个固定的 url) 到我的蜘蛛类使用上面的代码?
As you can see i can just pass the domain
to FollowAllSpider
but my question is that how can i pass the start_urls
(actually a date
that will been added to a Fixed url)to my spider class using above code?
这是我的蜘蛛类:
class MySpider(CrawlSpider):
name = 'tw'
def __init__(self,date):
y,m,d=date.split('-') #this is a test , it could split with regex!
try:
y,m,d=int(y),int(m),int(d)
except ValueError:
raise 'Enter a valid date'
self.allowed_domains = ['mydomin.com']
self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)]
def parse(self, response):
questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href')
for question in questions:
item = PoptopItem()
item['url'] = question.extract()
yield item['url']
这是我的脚本:
from pdfcreator import convertor
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
#from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
from poptop.spiders.stackoverflow_spider import MySpider
from poptop.items import PoptopItem
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
date=raw_input('Enter the date with this format (d-m-Y) : ')
print date
spider=MySpider(date=date)
crawler.crawl(spider)
crawler.start()
log.start()
item=PoptopItem()
for url in item['url']:
convertor(url)
reactor.run() # the script will block here until the spider_closed signal was sent
如果我只是打印item
,我会得到以下错误:
If i just print the item
i'll get the following error :
2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17>
项目:
import scrapy
class PoptopItem(scrapy.Item):
titles= scrapy.Field()
content= scrapy.Field()
url=scrapy.Field()
推荐答案
您需要修改 __init__()
构造函数以接受 date
参数.另外,我会使用 datetime.strptime()
解析日期字符串:
You need to modify your __init__()
constructor to accept the date
argument. Also, I would use datetime.strptime()
to parse the date string:
from datetime import datetime
class MySpider(CrawlSpider):
name = 'tw'
allowed_domains = ['test.com']
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
date = kwargs.get('date')
if not date:
raise ValueError('No date given')
dt = datetime.strptime(date, "%m-%d-%Y")
self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]
然后,您可以这样实例化蜘蛛:
Then, you would instantiate the spider this way:
spider = MySpider(date='01-01-2015')
或者,你甚至可以完全避免解析日期,首先传递一个 datetime
实例:
Or, you can even avoid parsing the date at all, passing a datetime
instance in the first place:
class MySpider(CrawlSpider):
name = 'tw'
allowed_domains = ['test.com']
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
dt = kwargs.get('dt')
if not dt:
raise ValueError('No date given')
self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]
spider = MySpider(dt=datetime(year=2014, month=01, day=01))
而且,仅供参考,请参阅此答案作为关于如何从脚本运行 Scrapy.
And, just FYI, see this answer as a detailed example about how to run Scrapy from script.
这篇关于在python脚本中将参数传递给scrapy spider的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!