为什么我的 Scrapy CrawlSpider 规则不起作用? [英] Why don't my Scrapy CrawlSpider rules work?

查看:46
本文介绍了为什么我的 Scrapy CrawlSpider 规则不起作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经设法用 Scrapy 编写了一个非常简单的爬虫,并带有这些给定的约束:

I've managed to code a very simple crawler with Scrapy, with these given constraints:

  • 存储所有链接信息(例如:锚文本、页面标题),因此有 2 个回调
  • 使用 CrawlSpider 来利用规则,因此没有 BaseSpider

它运行良好,但如果我向第一个请求添加回调,它不会实现规则!

It runs well, except it doesn't implement rules if I add a callback to the first request!

这是我的代码:(工作但不正确,有一个活生生的例子)

Here is my code: (works but not properly, with a live example)

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapySpider.items import SPage
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class TestSpider4(CrawlSpider):
    name = "spiderSO"
    allowed_domains = ["cumulodata.com"]
    start_urls = ["http://www.cumulodata.com"]
    extractor = SgmlLinkExtractor()

    def parse_start_url(self, response):
        #3
        print('----------manual call of',response)
        self.parse_links(response)
        print('----------manual call done')
        # 1 return Request(self.start_urls[0]) # does not call parse_links(example.com)
        # 2 return Request(self.start_urls[0],callback = self.parse_links) # does not call parse_links(example.com)

    rules = (
        Rule(extractor,callback='parse_links',follow=True),
        )

    def parse_links(self, response):
        hxs = HtmlXPathSelector(response)
        print('----------- manual parsing links of',response.url)
        links = hxs.select('//a')
        for link in links:
                title = link.select('@title')
                url = link.select('@href').extract()[0]
                meta={'title':title,}
                yield Request(url, callback = self.parse_page,meta=meta)

    def parse_page(self, response):
        print('----------- parsing page: ',response.url)
        hxs = HtmlXPathSelector(response)
        item=SPage()
        item['url'] = str(response.request.url)
        item['title']=response.meta['title']
        item['h1']=hxs.select('//h1/text()').extract()
        yield item

我尝试通过 3 种方式解决此问题:

I've tried solving this issue in 3 ways:

  • 1:返回带有起始 url 的请求 - 不执行规则
  • 2:与上述相同,但带有对 parse_links 的回调 - 同样的问题
  • 3:调用parse_links 抓取起始url,通过实现parse_start_url,函数不会被调用
  • 1: To return a Request with the start url - rules are not executed
  • 2: Same as above, but with a callback to parse_links - Same issue
  • 3: Call parse_links after scraping the start url, by implementing parse_start_url, function does not get called

这里是日志:

----------manual call of <200 http://www.cumulodata.com>)

----------manual call done

#No '----------- manual parsing links', so `parse_links` is never called!

版本

  • Python 2.7.2
  • Scrapy 0.14.4

推荐答案

这是一个完美运行的抓取工具:

Here's a scraper that works perfectly:

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapySpider.items import SPage
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class TestSpider4(CrawlSpider):
    name = "spiderSO"
    allowed_domains = ["cumulodata.com"]
    start_urls = ["http://www.cumulodata.com/"]

    extractor = SgmlLinkExtractor()

    rules = (
        Rule(extractor,callback='parse_links',follow=True),
        )

    def parse_start_url(self, response):
        list(self.parse_links(response))

    def parse_links(self, response):
        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a')
        for link in links:
            title = ''.join(link.select('./@title').extract())
            url = ''.join(link.select('./@href').extract())
            meta={'title':title,}
            cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url
            yield Request(cleaned_url, callback = self.parse_page, meta=meta,)

    def parse_page(self, response):
        hxs = HtmlXPathSelector(response)
        item=SPage()
        item['url'] = response.url
        item['title']=response.meta['title']
        item['h1']=hxs.select('//h1/text()').extract()
        return item

变化:

  1. 已实施 parse_start_url - 不幸的是,当您为第一个请求指定回调时,规则并未执行.这是内置在 Scrapy 中的,我们只能通过一种变通方法来管理它.所以我们在这个函数中做了一个 list(self.parse_links(response)) .为什么是 list()?因为 parse_links 是一个生成器,而生成器是惰性的.所以我们需要显式地完全调用它.

  1. Implemented parse_start_url - Unfortunately, when you specify a callback for the first request, rules are not executed. This is inbuilt into Scrapy, and we can only manage this with a workaround. So we do a list(self.parse_links(response)) inside this function. Why the list()? Because parse_links is a generator, and generators are lazy. So we need to explicitly call it fully.

cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url - 这里发生了几件事:

cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url - There are a couple of things going on here:

一个.我们将 '/?1' 添加到 URL 的末尾 - 由于 parse_links 返回重复的 URL,Scrapy 将它们过滤掉.避免这种情况的更简单方法是将 dont_filter=True 传递给 Request().但是,您的所有页面都是相互关联的(从 pageAA 返回索引等),并且此处的 dont_filter 会导致重复请求过多和项.

a. We're adding '/?1' to the end of the URL - Since parse_links returns duplicate URLs, Scrapy filters them out. An easier way to avoid that is to pass dont_filter=True to Request(). However, all your pages are interlinked (back to index from pageAA, etc.) and a dont_filter here results in too many duplicate requests & items.

B.if not '/' in url.partition('//')[2] - 同样,这是因为您网站中的链接.内部链接之一是www.cumulodata.com",另一个是www.cumulodata.com/".由于我们明确添加了一种允许重复的机制,这导致了一个额外的项目.因为我们需要完美,所以我实施了这个 hack.

b. if not '/' in url.partition('//')[2] - Again, this is because of the linking in your website. One of the internal links is to 'www.cumulodata.com' and another to 'www.cumulodata.com/'. Since we're explicitly adding a mechanism to allow duplicates, this was resulting in one extra item. Since we needed perfect, I implemented this hack.

title = ''.join(link.select('./@title').extract()) - 你不想返回节点,但数据.另外:在空列表的情况下,''.join(list) 比 list[0] 更好.

title = ''.join(link.select('./@title').extract()) - You don't want to return the node, but the data. Also: ''.join(list) is better than list[0] in case of an empty list.

恭喜你创建了一个测试网站,但它带来了一个奇怪的问题 - 重复既是必要的,也是不必要的!

Congrats on creating a test website which posed a curious problem - Duplicates are both necessary as well as unwanted!

这篇关于为什么我的 Scrapy CrawlSpider 规则不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆