传递参数给scrapy [英] passing arguments to scrapy
问题描述
我跟着这两个职位的意见,我也想创建一个通用的scrapy蜘蛛:
<一个href=\"http://stackoverflow.com/questions/15611605/how-to-pass-a-user-defined-argument-in-scrapy-spider\">How通过用户定义的参数在scrapy蜘蛛
但我发现了这个变量,我应该被作为参数传递是没有定义的错误。我失去了我的的init 方式的东西吗?
code:
从scrapy.spider进口BaseSpider
从scrapy.selector进口HtmlXPathSelector从data.items进口的DataItem类companySpider(BaseSpider):
NAME =沃兹 高清__init __(个体经营,域=):
'''
域是一个字符串
'''
self.domains =域 deny_domains =]
start_urls = [域] 高清解析(个体经营,响应):
HXS = HtmlXPathSelector(响应)
网站= hxs.select('/ HTML)
项= []
在网站的网站:
项目= DataItem的()
项目['文本'] = site.select('文本()')。提取物()
items.append(项目)
返回项目
下面是我的命令行:
scrapy抓取沃兹-a域=http://www.dmoz.org/Computers/Programming/Languages/Python/Books/
这里是错误:
NameError:名称'域'是没有定义
你应该叫超(companySpider,个体经营).__的init __(* ARGS,** kwargs)
在你的 __ __的init
的开始。
高清__init __(个体经营,域=,* ARGS,** kwargs):
超(companySpider,个体经营).__的init __(* ARGS,** kwargs)
self.domains =域
在您的情况下你的第一个请求依赖于蜘蛛的说法,我通常只覆盖 start_requests()
方法,没有覆盖 __ init__ ()
。在命令行参数名已位于可作为属性蜘蛛:
类companySpider(BaseSpider):
NAME =沃兹
deny_domains =] 高清start_requests(个体经营):
产量请求(self.domains)#例如,如果域名是单个URL 高清解析(个体经营,响应):
...
I followed the advice from these two posts as I am also trying to create a generic scrapy spider:
How to pass a user defined argument in scrapy spider
Creating a generic scrapy spider
But I'm getting an error that the variable I am supposed to be passing as an argument is not defined. Am I missing something in my init method?
Code:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from data.items import DataItem
class companySpider(BaseSpider):
name = "woz"
def __init__(self, domains=""):
'''
domains is a string
'''
self.domains = domains
deny_domains = [""]
start_urls = [domains]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('/html')
items = []
for site in sites:
item = DataItem()
item['text'] = site.select('text()').extract()
items.append(item)
return items
Here is my command-line:
scrapy crawl woz -a domains="http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
And here is the error:
NameError: name 'domains' is not defined
you should call super(companySpider, self).__init__(*args, **kwargs)
at the beginning of your __init__
.
def __init__(self, domains="", *args, **kwargs):
super(companySpider, self).__init__(*args, **kwargs)
self.domains = domains
In your case where your first requests depend on a spider argument, what I usually do is only override start_requests()
method, without overriding __init__()
. The parameter name from the command line is aleady available as an attribute to the spider:
class companySpider(BaseSpider):
name = "woz"
deny_domains = [""]
def start_requests(self):
yield Request(self.domains) # for example if domains is a single URL
def parse(self, response):
...
这篇关于传递参数给scrapy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!