Scrapy-如何每天抓取新页面 [英] Scrapy - How to scrape daily for new pages

查看:121
本文介绍了Scrapy-如何每天抓取新页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在评估草率是否适合我.我只想每天刮几个体育新闻网站以获得最新的头条新闻,并提取标题,日期和文章正文.我不在乎文章正文中的链接,我只想要正文.

Im evaluating if scrapy is right for me. All I want is to scrape several sports news sites daily for the latest headlines and extract the title, date and article body. I dont care about following links within the body of the article, i just want the body.

据我了解,抓取是一项一次性的工作,它会根据发现的链接对整个网站进行抓取. 我不想敲打站点,也不想抓取整个站点.只是体育部分,只有标题.

As I understand crawling is a one-off job, that crawls the entire site based on links its finds. I dont want to hammer site, and I also dont want to crawl the entire site; just sports section and only the headlines.

所以总而言之,我想抓狂

So in summary i want scrapy to

  1. 每天查找与昨天不同的新闻文章 来自指定的域
  2. 提取新文章的日期,时间和正文
  3. 将结果保存到数据库
  1. once a day find news articles that are different than yesterday from a specified domain
  2. extract new articles date, time and body
  3. save results to a database

是否有可能做到这一点,如果可以的话,我将如何实现这一目标.我已经阅读了该教程,但似乎他们描述的过程将一次性搜索整个站点.

Is it possible to do this, if so how would I achieve this. Ive read the tutorial, but it seems the process they describe would search an entire site as a one time job.

推荐答案

看看 deltafetch 中间件,它是 scrapy附加组件库发布的一部分 scrapinghub .它在磁盘上存储生成项目的页面的URL,并且将不会再次访问它们.它仍将允许scrapy访问其他页面(查找项目页面通常需要此页面).这是一个非常简单的示例,可以根据您的特定需求进行自定义.

Take a look at the deltafetch middleware, which is part of a scrapy library of addons published by scrapinghub. It stores the urls of pages that generate Items on disk and will not visit them again. It will still allow scrapy to visit other pages (which is typically needed to find the item pages). This is a pretty simple example that can be customized for your specific needs.

启用此中间件后,您需要每天(例如使用cron)运行抓取.

You would need to run your crawl daily (say, using cron) with this middleware enabled.

这篇关于Scrapy-如何每天抓取新页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆