Scrapy:在多个页面上使用项目加载器填充项目 [英] Scrapy: populate items with item loaders over multiple pages

查看:33
本文介绍了Scrapy:在多个页面上使用项目加载器填充项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

鉴于多个网址,我正在尝试抓取和抓取多个页面.我正在使用 Wikipedia 进行测试,为了方便起见,我只是为每个页面使用了相同的 Xpath 选择器,但我最终希望使用每个页面独有的许多不同的 Xpath 选择器,因此每个页面都有自己单独的 parsePage 方法.

I'm trying to crawl and scrape multiple pages, given multiple urls. I am testing with Wikipedia, and to make it easier I just used the same Xpath selector for each page, but I eventually want to use many different Xpath selectors unique to each page, so each page has its own separate parsePage method.

当我不使用项目加载器而直接填充项目时,此代码可以完美运行.当我使用项目加载器时,项目的填充很奇怪,它似乎完全忽略了在 parse 方法中分配的回调,而只将 start_urls 用于 parsePage 方法.

This code works perfectly when I don't use item loaders, and just populate items directly. When I use item loaders, the items are populated strangely, and it seems to be completely ignoring the callback assigned in the parse method and only using the start_urls for the parsePage methods.

import scrapy
from scrapy.http import Request
from scrapy import Spider, Request, Selector
from testanother.items import TestItems, TheLoader

class tester(scrapy.Spider):
name = 'vs'
handle_httpstatus_list = [404, 200, 300]
#Usually, I only get data from the first start url
start_urls = ['https://en.wikipedia.org/wiki/SANZAAR','https://en.wikipedia.org/wiki/2016_Rugby_Championship','https://en.wikipedia.org/wiki/2016_Super_Rugby_season']
def parse(self, response):
   #item = TestItems()
    l = TheLoader(item=TestItems(), response=response)
    #when I use an item loader, the url in the request is completely ignored. without the item loader, it works properly.
    request = Request("https://en.wikipedia.org/wiki/2016_Rugby_Championship", callback=self.parsePage1, meta={'loadernext':l}, dont_filter=True)
    yield request

    request = Request("https://en.wikipedia.org/wiki/SANZAAR", callback=self.parsePage2, meta={'loadernext1': l}, dont_filter=True)
    yield request

    yield Request("https://en.wikipedia.org/wiki/2016_Super_Rugby_season", callback=self.parsePage3, meta={'loadernext2': l}, dont_filter=True)

def parsePage1(self,response):
    loadernext = response.meta['loadernext']
    loadernext.add_xpath('title1', '//*[@id="firstHeading"]/text()')
    return loadernext.load_item()
#I'm not sure if this return and load_item is the problem, because I've tried yielding/returning to another method that does the item loading instead and the first start url is still the only url scraped. 
def parsePage2(self,response):
    loadernext1 = response.meta['loadernext1']
    loadernext1.add_xpath('title2', '//*[@id="firstHeading"]/text()')
    return loadernext1.load_item()

def parsePage3(self,response):
    loadernext2 = response.meta['loadernext2']
    loadernext2.add_xpath('title3', '//*[@id="firstHeading"]/text()')
    return loadernext2.load_item()

这是我不使用项目加载器时的结果:

Here's the result when I don't use item loaders:

{'title1': [u'2016 Rugby Championship'],
 'title': [u'SANZAAR'],
 'title3': [u'2016 Super Rugby season']}

这里有一些关于项目加载器的日志:

Here's the a bit of the log with item loaders:

{'title2': u'SANZAAR'}
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/SANZAAR)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship)
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season>
{'title2': u'SANZAAR', 'title3': u'SANZAAR'}
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/SANZAAR> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Rugby_Championship> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Super_Rugby_season> (referer: https://en.wikipedia.org/wiki/2016_Rugby_Championship)
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/2016_Super_Rugby_season> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season)
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship>
{'title1': u'SANZAAR', 'title2': u'SANZAAR', 'title3': u'SANZAAR'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship>
{'title1': u'2016 Rugby Championship'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/SANZAAR>
{'title1': u'2016 Rugby Championship', 'title2': u'2016 Rugby Championship'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Rugby_Championship>
{'title1': u'2016 Super Rugby season'}
2016-09-24 14:30:43 [scrapy] DEBUG: Crawled (200) <GET https://en.wikipedia.org/wiki/SANZAAR> (referer: https://en.wikipedia.org/wiki/2016_Super_Rugby_season)
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season>
{'title1': u'2016 Rugby Championship',
 'title2': u'2016 Rugby Championship',
 'title3': u'2016 Rugby Championship'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/2016_Super_Rugby_season>
{'title1': u'2016 Super Rugby season', 'title3': u'2016 Super Rugby season'}
2016-09-24 14:30:43 [scrapy] DEBUG: Scraped from <200 https://en.wikipedia.org/wiki/SANZAAR>
{'title1': u'2016 Super Rugby season',
 'title2': u'2016 Super Rugby season',
 'title3': u'2016 Super Rugby season'}
 2016-09-24 14:30:43 [scrapy] INFO: Clos

到底出了什么问题?谢谢!

What exactly is going wrong? Thanks!

推荐答案

一个问题是您将同一个项目加载器实例的多个引用传递到多个回调中,例如parse中有两条yield request指令.

One issue is that you're passing multiple references of a same item loader instance into multiple callbacks, e.g. there are two yield request instructions in parse.

此外,在后续回调中,加载器仍在使用旧的 response 对象,例如在 parsePage1 中,项目加载器仍在对来自 parseresponse 进行操作.

Also, in the following-up callbacks, the loader is still using the old response object, e.g. in parsePage1 the item loader is still operating on the response from parse.

在大多数情况下,不建议将项目加载器传递给另一个回调.或者,您可能会发现直接传递项目对象更好.

In most of the cases it is not suggested to pass item loaders to another callback. Alternatively, you might find it better to pass item objects directly.

这是一个简短(且不完整)的示例,通过编辑您的代码:

Here's a short (and incomplete) example, by editing your code:

def parse(self, response):
    l = TheLoader(item=TestItems(), response=response)
    request = Request(
        "https://en.wikipedia.org/wiki/2016_Rugby_Championship",
        callback=self.parsePage1,
        meta={'item': l.load_item()},
        dont_filter=True
    )
    yield request

def parsePage1(self,response):
    loadernext = TheLoader(item=response.meta['item'], response=response)
    loadernext.add_xpath('title1', '//*[@id="firstHeading"]/text()')
    return loadernext.load_item()

这篇关于Scrapy:在多个页面上使用项目加载器填充项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆