Scrapy CrawlSpider 不会抓取第一个登陆页面 [英] Scrapy CrawlSpider doesn't crawl the first landing page

查看:57
本文介绍了Scrapy CrawlSpider 不会抓取第一个登陆页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Scrapy 的新手,我正在做一个抓取练习,我正在使用 CrawlSpider.尽管 Scrapy 框架运行良好并且它遵循相关链接,但我似乎无法让 CrawlSpider 抓取第一个链接(主页/登录页面).相反,它会直接抓取由规则确定的链接,但不会抓取链接所在的登录页面.我不知道如何解决这个问题,因为不建议覆盖 CrawlSpider 的解析方法.修改 follow=True/False 也不会产生任何好的结果.下面是代码片段:

I am new to Scrapy and I am working on a scraping exercise and I am using the CrawlSpider. Although the Scrapy framework works beautifully and it follows the relevant links, I can't seem to make the CrawlSpider to scrape the very first link (the home page / landing page). Instead it goes directly to scrape the links determined by the rule but doesn't scrape the landing page on which the links are. I don't know how to fix this since it is not recommended to overwrite the parse method for a CrawlSpider. Modifying follow=True/False also doesn't yield any good results. Here is the snippet of code:

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    allowed_domains = ['bnt-chemicals.de']
    start_urls = [
        "http://www.bnt-chemicals.de"        
        ]
    rules = (   
        Rule(SgmlLinkExtractor(aloow='prod'), callback='parse_item', follow=True),
        )
    fname = 1

    def parse_item(self, response):
        open(str(self.fname)+ '.txt', 'a').write(response.url)
        open(str(self.fname)+ '.txt', 'a').write(','+ str(response.meta['depth']))
        open(str(self.fname)+ '.txt', 'a').write('\n')
        open(str(self.fname)+ '.txt', 'a').write(response.body)
        open(str(self.fname)+ '.txt', 'a').write('\n')
        self.fname = self.fname + 1

推荐答案

只需将回调更改为 parse_start_url 并覆盖它:

Just change your callback to parse_start_url and override it:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class DownloadSpider(CrawlSpider):
    name = 'downloader'
    allowed_domains = ['bnt-chemicals.de']
    start_urls = [
        "http://www.bnt-chemicals.de",
    ]
    rules = (
        Rule(SgmlLinkExtractor(allow='prod'), callback='parse_start_url', follow=True),
    )
    fname = 0

    def parse_start_url(self, response):
        self.fname += 1
        fname = '%s.txt' % self.fname

        with open(fname, 'w') as f:
            f.write('%s, %s\n' % (response.url, response.meta.get('depth', 0)))
            f.write('%s\n' % response.body)

这篇关于Scrapy CrawlSpider 不会抓取第一个登陆页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆