Scrapy Spider 不关注链接 [英] Scrapy Spider not Following Links

查看:44
本文介绍了Scrapy Spider 不关注链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在写一个爬虫爬虫从主页上抓取今天的纽约时报文章,但由于某种原因它没有遵循任何链接.当我在 scrapy shell http://www.nytimes.com 中实例化链接提取器时,它成功地提取了带有 le.extract_links(response) 的文章网址列表,但是我无法让我的抓取命令 (scrapy crawl nyt -o out.json) 抓取除主页之外的任何内容.我有点不知所措.是不是因为主页没有从解析函数中产生文章?任何帮助是极大的赞赏.

I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com, it successfully extracts a list of article urls with le.extract_links(response), but I can't get my crawl command (scrapy crawl nyt -o out.json) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated.

from datetime import date                                                       

import scrapy                                                                   
from scrapy.contrib.spiders import Rule                                         
from scrapy.contrib.linkextractors import LinkExtractor                         


from ..items import NewsArticle                                                 

with open('urls/debug/nyt.txt') as debug_urls:                                  
    debug_urls = debug_urls.readlines()                                         

with open('urls/release/nyt.txt') as release_urls:                              
    release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                 

today = date.today().strftime('%Y/%m/%d')                                       
print today                                                                     


class NytSpider(scrapy.Spider):                                                 
    name = "nyt"                                                                
    allowed_domains = ["nytimes.com"]                                           
    start_urls = release_urls                                                      
    rules = (                                                                      
            Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),          
                 callback='parse', follow=True),                                   
    )                                                                              

    def parse(self, response):                                                     
        article = NewsArticle()                                                                         
        for story in response.xpath('//article[@id="story"]'):                     
            article['url'] = response.url                                          
            article['title'] = story.xpath(                                        
                    '//h1[@id="story-heading"]/text()').extract()                  
            article['author'] = story.xpath(                                       
                    '//span[@class="byline-author"]/@data-byline-name'             
            ).extract()                                                         
            article['published'] = story.xpath(                                 
                    '//time[@class="dateline"]/@datetime').extract()            
            article['content'] = story.xpath(                                   
                    '//div[@id="story-body"]/p//text()').extract()              
            yield article  

推荐答案

我找到了问题的解决方案.我做错了两件事:

I have found the solution to my problem. I was doing 2 things wrong:

  1. 如果我想让 CrawlSpider 自动抓取子链接,我需要子类化 CrawlSpider 而不是 Spider.
  2. 在使用CrawlSpider 时,我需要使用回调函数而不是覆盖parse.根据文档,覆盖 parse 会破坏 CrawlSpider 功能.
  1. I needed to subclass CrawlSpider rather than Spider if I wanted it to automatically crawl sublinks.
  2. When using CrawlSpider, I needed to use a callback function rather than overriding parse. As per the docs, overriding parse breaks CrawlSpider functionality.

这篇关于Scrapy Spider 不关注链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆