Scrapy Spider 不关注链接 [英] Scrapy Spider not Following Links

查看：44 发布时间：2021/7/17 18:30:36 python scrapy scrapy-spider

本文介绍了Scrapy Spider 不关注链接的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在写一个爬虫爬虫从主页上抓取今天的纽约时报文章，但由于某种原因它没有遵循任何链接.当我在 scrapy shell http://www.nytimes.com 中实例化链接提取器时，它成功地提取了带有 le.extract_links(response) 的文章网址列表，但是我无法让我的抓取命令 (scrapy crawl nyt -o out.json) 抓取除主页之外的任何内容.我有点不知所措.是不是因为主页没有从解析函数中产生文章?任何帮助是极大的赞赏.

I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com, it successfully extracts a list of article urls with le.extract_links(response), but I can't get my crawl command (scrapy crawl nyt -o out.json) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated.

from datetime import date                                                       

import scrapy                                                                   
from scrapy.contrib.spiders import Rule                                         
from scrapy.contrib.linkextractors import LinkExtractor                         


from ..items import NewsArticle                                                 

with open('urls/debug/nyt.txt') as debug_urls:                                  
    debug_urls = debug_urls.readlines()                                         

with open('urls/release/nyt.txt') as release_urls:                              
    release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                 

today = date.today().strftime('%Y/%m/%d')                                       
print today                                                                     


class NytSpider(scrapy.Spider):                                                 
    name = "nyt"                                                                
    allowed_domains = ["nytimes.com"]                                           
    start_urls = release_urls                                                      
    rules = (                                                                      
            Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),          
                 callback='parse', follow=True),                                   
    )                                                                              

    def parse(self, response):                                                     
        article = NewsArticle()                                                                         
        for story in response.xpath('//article[@id="story"]'):                     
            article['url'] = response.url                                          
            article['title'] = story.xpath(                                        
                    '//h1[@id="story-heading"]/text()').extract()                  
            article['author'] = story.xpath(                                       
                    '//span[@class="byline-author"]/@data-byline-name'             
            ).extract()                                                         
            article['published'] = story.xpath(                                 
                    '//time[@class="dateline"]/@datetime').extract()            
            article['content'] = story.xpath(                                   
                    '//div[@id="story-body"]/p//text()').extract()              
            yield article

Scrapy Spider 不关注链接 [英] Scrapy Spider not Following Links

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Scrapy Spider 不关注链接 [英] Scrapy Spider not Following Links

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭