Scrapy - 每个项目抓取多个页面 [英] Scrapy - Crawl Multiple Pages Per Item
问题描述
我正在尝试为每个项目抓取一些额外的页面以获取一些位置信息.
I am trying to crawl a few extra pages per item to grab some location information.
在返回前的项目末尾,我检查是否需要抓取额外的页面来获取信息,本质上这些页面包含一些位置详细信息并且是一个简单的获取请求.
At the end of the item before return I check to see if we need to crawl extra pages to grab the information, essentially these pages contain some location details and are a simple get request.
即http://site.com.au/MVC/Offer/GetLocationDetails/?locationId=3761&companyId=206
上面的链接要么返回一个包含更多要抓取的页面的选择,要么返回一个包含地址详细信息的 dd/dt.无论哪种方式,我都需要提取此地址信息并将其附加到我的 item['locations']
The above link either returns a select with more pages to crawl - or a dd/dt with the address details. Either way I need to extract this address info and append it to my item['locations']
到目前为止我有(在解析块的末尾)
So far I have (at the end of parse block)
return self.fetchLocations(locations_selector, company_id, item)
locations_selector 包含一个 locationIds 列表
locations_selector contains a list of locationIds
那我有
def fetchLocations(self, locations, company_id, item): #response):
for location in locations:
if len(location)>1:
yield Request("http://site.com.au/MVC/Offer/GetLocationDetails/?locationId="+location+"&companyId="+company_id,
callback=self.parseLocation,
meta={'company_id': company_id, 'item': item})
最后
def parseLocation(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
dl = hxs.select("//dl")
if len(dl)>0:
address = hxs.select("//dl[1]/dd").extract()
loc = {'address':remove_entities(replace_escape_chars(replace_tags(address[0], token=' '), replace_by=''))}
yield loc
locations_select = hxs.select("//select/option/@value").extract()
if len(locations_select)>0:
yield self.fetchLocations(locations_select, response.meta['company_id'], item)
似乎无法让这个工作......
Can't seem to get this working....
推荐答案
这是你的代码:
def parseLocation(self,response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
dl = hxs.select("//dl")
if len(dl)>0:
address = hxs.select("//dl[1]/dd").extract()
loc = {'address':remove_entities(replace_escape_chars(replace_tags(address[0], token=' '), replace_by=''))}
yield loc
locations_select = hxs.select("//select/option/@value").extract()
if len(locations_select)>0:
yield self.fetchLocations(locations_select, response.meta['company_id'], item)
回调必须返回对其他页面或项目的请求.在上面的代码中可以看到产生的请求,但不是项目.您有 yield loc
,但 loc
是 dict
而不是 Item
子类.
Callbacks must return either requests to other pages or items. In the code above is see requests yielded, but not items. You have yield loc
, but loc
is a dict
not Item
subclass.
这篇关于Scrapy - 每个项目抓取多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!