Scrapy - 解析页面以提取项目 - 然后关注并存储项目 url 内容 [英] Scrapy - parse a page to extract items - then follow and store item url contents

查看:28
本文介绍了Scrapy - 解析页面以提取项目 - 然后关注并存储项目 url 内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个关于如何在scrapy中做这件事的问题.我有一个爬行以列出项目页面的蜘蛛.每次找到包含商品的列表页面时,都会调用 parse_item() 回调来提取商品数据并生成商品.到目前为止一切顺利,一切正常.

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there's the parse_item() callback that is called for extracting items data, and yielding items. So far so good, everything works great.

但是每个项目,除其他数据外,还有一个网址,其中包含有关该项目的更多详细信息.我想跟踪该 url 并将该项目的 url 的获取内容存储在另一个项目字段 (url_contents) 中.

But each item, has among other data, an url, with more details on that item. I want to follow that url and store in another item field (url_contents) the fetched contents of that item's url.

而且我不确定如何组织代码来实现这一点,因为两个链接(列表链接和一个特定项目链接)的遵循方式不同,回调在不同时间调用,但我必须在相同的项目处理.

And I'm not sure how to organize code to achieve that, since the two links (listings link, and one particular item link) are followed differently, with callbacks called at different times, but I have to correlate them in the same item processing.

到目前为止我的代码是这样的:

My code so far looks like this:

class MySpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/?q=example",
    ]

    rules = (
        Rule(SgmlLinkExtractor(allow=('example\.com', 'start='), deny=('sort='), restrict_xpaths = '//div[@class="pagination"]'), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('item\/detail', )), follow = False),
    )


    def parse_item(self, response):
        main_selector = HtmlXPathSelector(response)
        xpath = '//h2[@class="title"]'

        sub_selectors = main_selector.select(xpath)

        for sel in sub_selectors:
            item = ExampleItem()
            l = ExampleLoader(item = item, selector = sel)
            l.add_xpath('title', 'a[@title]/@title')
            ......
            yield l.load_item()

推荐答案

经过一些测试和思考,我找到了适合我的解决方案.我们的想法是只使用第一个规则,它为您提供项目列表,而且非常重要的是,将 follow=True 添加到该规则中.

After some testing and thinking, I found this solution that works for me. The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.

在 parse_item() 中,你必须产生一个请求而不是一个项目,但在你加载项目之后.请求是项目详细信息 url.并且您必须将加载的项目发送到该请求回调.您根据响应完成您的工作,然后您就可以产出项目.

And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.

所以 parse_item() 的结束看起来像这样:

So the finish of parse_item() will look like this:

itemloaded = l.load_item()

# fill url contents
url = sel.select(item_url_xpath).extract()[0]
request = Request(url, callback = lambda r: self.parse_url_contents(r))
request.meta['item'] = itemloaded

yield request

然后 parse_url_contents() 将如下所示:

And then parse_url_contents() will look like this:

def parse_url_contents(self, response):
    item = response.request.meta['item']
    item['url_contents'] = response.body
    yield item

如果有人有其他(更好的)方法,请告诉我们.

If anyone has another (better) approach, let us know.

斯蒂芬

这篇关于Scrapy - 解析页面以提取项目 - 然后关注并存储项目 url 内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆