抓取非 RSS 页面以生成提要 [英] scraping a non RSS page to generate a feed

查看:46
本文介绍了抓取非 RSS 页面以生成提要的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想抓取一个定期更新的页面(添加与以前的文章结构完全相同的新文章)以生成 RSS 提要.

I want to scrape a page that regularly updates (adding new articles with exactly the same structure as previous ones) in order to generate an RSS feed.

我可以轻松编写代码来分析页面,但是我如何模拟 ping,即当页面更新时我的 php 脚本如何知道?它必须是一项 Cron 工作吗?

I can write the code to analyse the page easily, but how do I emulate a ping i.e. when the page updates how can my php script know? Does it have to be a cron job?

(可能是我知道的重复问题,但我没有运气就搜索了直接答案.最近我得到的是 抓取并生成 RSS 提要,它有一个抓取脚本,但没有关于如何让它自动响应页面变化的信息)

(Probably a duplicate question I know, but searched for a direct answer with no luck. Closest I got was Scrape and generate RSS feed, which has a scraping script but no info on how to get it to respond to changes on the page automatically)

推荐答案

根据系统的不同,可能不容易判断页面上次更新的时间.

Depending on the system it may or may not be easy to tell when the page was updated last.

要检查更改,您可以检查页面的 Last-Modified 标头的 HTTP 标头.并非所有系统都正确更新标头,因此它可能没有用.未修改的页面也有可能返回 304(未修改)状态,特别是如果您在请求中提供了 If-Modified-Since 标头.

To check for changes, you can check the HTTP headers for the Last-Modified header of the page. Not all systems update the header properly, so it may not be useful. It's also possible that unmodified page will return a status of 304 (Not Modified), particularly if you provide a If-Modified-Since header in your request.

我肯定会在 cron 作业上运行这样的东西.虽然它可能可能仅从标题中进行,但如果您必须更新页面,您的用户将等待很长时间(相对而言)您的服务器退出,获取页面,进行处理,并发送响应.如果您没有使用基于非 cron 的方法时不时遇到超时,我会感到惊讶.

I would definitely run something like this on a cron job. While it might be possible do it just from the headers, if you have to update the page your user will be waiting a long time (in relative terms) for your server to go out, get the page, do the processing, and send the response. I would be surprised if you didn't run into time outs from time to time with a non-cron based a approach.

这篇关于抓取非 RSS 页面以生成提要的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆