Scrapy 不通过请求回调从项目中的已删除链接返回附加信息 [英] Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback

查看:23
本文介绍了Scrapy 不通过请求回调从项目中的已删除链接返回附加信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,下面的代码会抓取表格的前 5 项.其中一个字段是另一个 href,单击该 href 会提供更多信息,我想收集这些信息并将其添加到原始项目中.所以 parse 应该将半填充的 item 传递给 parse_next_page,然后它刮下一个位,并将完成的 item 返回给 解析

Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse is supposed to pass the semi populated item to parse_next_page which then scrapes the next bit and should return the completed item back to parse

运行下面的代码只返回parse中收集的信息如果我将 return items 更改为 return request,我会得到一个包含所有 3 个事物"的完整项目,但我只得到 1 行,而不是全部 5 行.我确定它很简单,我只是看不到它.

Running the code below only returns the info collected in parse If I change the return items to return request I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5. Im sure its something simple, I just can't see it.

class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

    for x in range (1,6):
        item = ScrapyItem()
        str_selector = '//tr[@name="row{0}"]'.format(x)
        item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
        item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
        print 'hello'
        request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
        print 'hello2'
        request.meta['item'] = item
        items.append(item)      

    return items


def parse_next_page(self, response):
    print 'stuff'
    hxs = HtmlXPathSelector(response)
    item = response.meta['item']
    item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
    return item

推荐答案

安装 pyOpenSSL ,有时Fiddler 还会为https:\*"请求带来问题.如果运行并再次运行蜘蛛,请关闭提琴手.另一个问题是在您的代码中,您在 parse 方法中使用了生成器,而不是使用 'yeild' 将请求返回给 scrapy 调度程序.你应该这样做......

Install pyOpenSSL , sometimes fiddler also creates problem for "https:\*" requests. Close fiddler if running and run spider again. Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler. You should do it like this....

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    items = []

for x in range (1,6):
    item = ScrapyItem()
    str_selector = '//tr[@name="row{0}"]'.format(x)
    item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
    item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
    print 'hello'
    request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
    if request:
         yield request
    else:
         yield item

这篇关于Scrapy 不通过请求回调从项目中的已删除链接返回附加信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆