Scrapy 不通过请求回调从项目中的已删除链接返回附加信息 [英] Scrapy Not Returning Additonal Info from Scraped Link in Item via Request Callback
问题描述
基本上,下面的代码会抓取表格的前 5 项.其中一个字段是另一个 href,单击该 href 会提供更多信息,我想收集这些信息并将其添加到原始项目中.所以 parse
应该将半填充的 item 传递给 parse_next_page
,然后它刮下一个位,并将完成的 item
返回给 解析
Basically the code below scrapes the first 5 items of a table. One of the fields is another href and clicking on that href provides more info which I want to collect and add to the original item. So parse
is supposed to pass the semi populated item to parse_next_page
which then scrapes the next bit and should return the completed item
back to parse
运行下面的代码只返回parse
中收集的信息如果我将 return items
更改为 return request
,我会得到一个包含所有 3 个事物"的完整项目,但我只得到 1 行,而不是全部 5 行.我确定它很简单,我只是看不到它.
Running the code below only returns the info collected in parse
If I change the return items
to return request
I get a completed item with all 3 "things" but I only get 1 of the rows, not all 5.
Im sure its something simple, I just can't see it.
class ThingSpider(BaseSpider):
name = "thing"
allowed_domains = ["somepage.com"]
start_urls = [
"http://www.somepage.com"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com", callback=self.parse_next_page,meta={'item':item})
print 'hello2'
request.meta['item'] = item
items.append(item)
return items
def parse_next_page(self, response):
print 'stuff'
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['thing3'] = hxs.select('//div/ul/li[1]/span[2]/text()').extract()
return item
推荐答案
安装 pyOpenSSL ,有时Fiddler 还会为https:\*"请求带来问题.如果运行并再次运行蜘蛛,请关闭提琴手.另一个问题是在您的代码中,您在 parse 方法中使用了生成器,而不是使用 'yeild' 将请求返回给 scrapy 调度程序.你应该这样做......
Install pyOpenSSL , sometimes fiddler also creates problem for "https:\*" requests. Close fiddler if running and run spider again. Another problem which is in your code that you are using a generator in parse method and not using 'yeild' to return the request to scrapy scheduler. You should do it like this....
def parse(self, response):
hxs = HtmlXPathSelector(response)
items = []
for x in range (1,6):
item = ScrapyItem()
str_selector = '//tr[@name="row{0}"]'.format(x)
item['thing1'] = hxs.select(str_selector")]/a/text()').extract()
item['thing2'] = hxs.select(str_selector")]/a/@href').extract()
print 'hello'
request = Request("www.nextpage.com",callback=self.parse_next_page,meta{'item':item})
if request:
yield request
else:
yield item
这篇关于Scrapy 不通过请求回调从项目中的已删除链接返回附加信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!