Scrapy - 每个项目抓取多个页面 [英] Scrapy - Crawl Multiple Pages Per Item

查看:35
本文介绍了Scrapy - 每个项目抓取多个页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试为每个项目抓取一些额外的页面以获取一些位置信息.

I am trying to crawl a few extra pages per item to grab some location information.

在返回前的项目末尾,我检查是否需要抓取额外的页面来获取信息,本质上这些页面包含一些位置详细信息并且是一个简单的获取请求.

At the end of the item before return I check to see if we need to crawl extra pages to grab the information, essentially these pages contain some location details and are a simple get request.

http://site.com.au/MVC/Offer/GetLocationDetails/?locationId=3761&companyId=206

上面的链接要么返回一个包含更多要抓取的页面的选择,要么返回一个包含地址详细信息的 dd/dt.无论哪种方式,我都需要提取此地址信息并将其附加到我的 item['locations']

The above link either returns a select with more pages to crawl - or a dd/dt with the address details. Either way I need to extract this address info and append it to my item['locations']

到目前为止我有(在解析块的末尾)

So far I have (at the end of parse block)

return self.fetchLocations(locations_selector, company_id, item)

locations_selector 包含一个 locationIds 列表

locations_selector contains a list of locationIds

那我有

def fetchLocations(self, locations, company_id, item): #response):
    for location in locations:
        if len(location)>1:
            yield Request("http://site.com.au/MVC/Offer/GetLocationDetails/?locationId="+location+"&companyId="+company_id,
            callback=self.parseLocation,
                meta={'company_id': company_id, 'item': item})

最后

def parseLocation(self,response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']

    dl = hxs.select("//dl")
    if len(dl)>0:
        address = hxs.select("//dl[1]/dd").extract()
        loc = {'address':remove_entities(replace_escape_chars(replace_tags(address[0], token=' '), replace_by=''))}
        yield loc

    locations_select = hxs.select("//select/option/@value").extract()
    if len(locations_select)>0:
        yield self.fetchLocations(locations_select, response.meta['company_id'], item)

似乎无法让这个工作......

Can't seem to get this working....

推荐答案

这是你的代码:

def parseLocation(self,response):
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']

    dl = hxs.select("//dl")
    if len(dl)>0:
        address = hxs.select("//dl[1]/dd").extract()
        loc = {'address':remove_entities(replace_escape_chars(replace_tags(address[0], token=' '), replace_by=''))}
        yield loc

    locations_select = hxs.select("//select/option/@value").extract()
    if len(locations_select)>0:
        yield self.fetchLocations(locations_select, response.meta['company_id'], item)

回调必须返回对其他页面或项目的请求.在上面的代码中可以看到产生的请求,但不是项目.您有 yield loc,但 locdict 而不是 Item 子类.

Callbacks must return either requests to other pages or items. In the code above is see requests yielded, but not items. You have yield loc, but loc is a dict not Item subclass.

这篇关于Scrapy - 每个项目抓取多个页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆