以下链接,Scrapy 网络爬虫框架 [英] Following links, Scrapy web crawler framework

查看:30
本文介绍了以下链接,Scrapy 网络爬虫框架的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在多次阅读 Scrapy 文档后,我仍然没有发现使用 CrawlSpider 规则和在回调方法上实现我自己的链接提取机制之间的区别.

After several readings to Scrapy docs I'm still not catching the diferrence between using CrawlSpider rules and implementing my own link extraction mechanism on the callback method.

我即将使用后一种方法编写一个新的网络爬虫,但只是因为我在过去使用规则的项目中遇到了糟糕的经历.我真的很想确切地知道我在做什么以及为什么.

I'm about to write a new web crawler using the latter approach, but just becuase I had a bad experience in a past project using rules. I'd really like to know exactly what I'm doing and why.

有人熟悉这个工具吗?

感谢您的帮助!

推荐答案

CrawlSpider 继承了 BaseSpider.它只是添加了提取和跟踪链接的规则.如果这些规则对您来说不够灵活 - 使用 BaseSpider:

CrawlSpider inherits BaseSpider. It just added rules to extract and follow links. If these rules are not enough flexible for you - use BaseSpider:

class USpider(BaseSpider):
    """my spider. """

    start_urls = ['http://www.amazon.com/s/?url=search-alias%3Dapparel&sort=relevance-fs-browse-rank']
    allowed_domains = ['amazon.com']

    def parse(self, response):
        '''Parse main category search page and extract subcategory search link.'''
        self.log('Downloaded category search page.', log.DEBUG)
        if response.meta['depth'] > 5:
            self.log('Categories depth limit reached (recursive links?). Stopping further following.', log.WARNING)

        hxs = HtmlXPathSelector(response)
        subcategories = hxs.select("//div[@id='refinements']/*[starts-with(.,'Department')]/following-sibling::ul[1]/li/a[span[@class='refinementLink']]/@href").extract()
        for subcategory in subcategories:
            subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
            yield Request(subcategorySearchLink, callback = self.parseSubcategory)

    def parseSubcategory(self, response):
        '''Parse subcategory search page and extract item links.'''
        hxs = HtmlXPathSelector(response)

        for itemLink in hxs.select('//a[@class="title"]/@href').extract():
            itemLink = urlparse.urljoin(response.url, itemLink)
            self.log('Requesting item page: ' + itemLink, log.DEBUG)
            yield Request(itemLink, callback = self.parseItem)

        try:
            nextPageLink = hxs.select("//a[@id='pagnNextLink']/@href").extract()[0]
            nextPageLink = urlparse.urljoin(response.url, nextPageLink)
            self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
            yield Request(nextPageLink, callback = self.parseSubcategory)
        except:
            self.log('Whole category parsed: ' + categoryPath, log.DEBUG)

    def parseItem(self, response):
        '''Parse item page and extract product info.'''

        hxs = HtmlXPathSelector(response)
        item = UItem()

        item['brand'] = self.extractText("//div[@class='buying']/span[1]/a[1]", hxs)
        item['title'] = self.extractText("//span[@id='btAsinTitle']", hxs)
        ...

即使 BaseSpider 的 start_urls 对你来说不够灵活,覆盖 start_requests 方法.

Even if BaseSpider's start_urls are not enough flexible for you, override start_requests method.

这篇关于以下链接,Scrapy 网络爬虫框架的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆