Scrapy Spider 不关注链接 [英] Scrapy Spider not Following Links
问题描述
我正在写一个爬虫爬虫从主页上抓取今天的纽约时报文章,但由于某种原因它没有遵循任何链接.当我在 scrapy shell http://www.nytimes.com
中实例化链接提取器时,它成功地提取了带有 le.extract_links(response)
的文章网址列表,但是我无法让我的抓取命令 (scrapy crawl nyt -o out.json
) 抓取除主页之外的任何内容.我有点不知所措.是不是因为主页没有从解析函数中产生文章?任何帮助是极大的赞赏.
I'm writing a scrapy spider to crawl for today's NYT articles from the homepage, but for some reason it doesn't follow any links. When I instantiate the link extractor in scrapy shell http://www.nytimes.com
, it successfully extracts a list of article urls with le.extract_links(response)
, but I can't get my crawl command (scrapy crawl nyt -o out.json
) to scrape anything but the homepage. I'm sort of at my wit's end. Is it because the homepage does not yield an article from the parse function? Any help is greatly appreciated.
from datetime import date
import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor
from ..items import NewsArticle
with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()
with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]
today = date.today().strftime('%Y/%m/%d')
print today
class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)
def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article
推荐答案
我找到了问题的解决方案.我做错了两件事:
I have found the solution to my problem. I was doing 2 things wrong:
- 如果我想让
CrawlSpider
自动抓取子链接,我需要子类化CrawlSpider
而不是Spider
. - 在使用
CrawlSpider
时,我需要使用回调函数而不是覆盖parse
.根据文档,覆盖parse
会破坏CrawlSpider
功能.
- I needed to subclass
CrawlSpider
rather thanSpider
if I wanted it to automatically crawl sublinks. - When using
CrawlSpider
, I needed to use a callback function rather than overridingparse
. As per the docs, overridingparse
breaksCrawlSpider
functionality.
这篇关于Scrapy Spider 不关注链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!