如何检测更改的网页? [英] How to detect a changed webpage?

查看:134
本文介绍了如何检测更改的网页?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的应用程序中,我使用LWP定期获取网页。无论如何,是否要检查两次连续获取之间的网页是否在某些方面发生了变化(除了明确进行比较以外)?

解决方案
解决方案

有两种可能的方法。一种是使用网页摘要,例如

  use strict; 
使用警告;

使用Digest :: MD5‘md5_hex’;
使用LWP :: UserAgent;

#获取页面,等等。
my $ digest = md5_hex $ response-> decoded_content;

if($ digest ne $ saved_digest){
#页面已更改。
}

另一种选择是使用HTTP ETag, if 服务器为请求的资源提供一个。您可以简单地存储它,然后将请求标头设置为在后续请求中包含 If-None-Match 字段。如果服务器的ETag保持不变,您将获得 304未修改状态和一个空的响应正文。否则,您将获得新页面。 (以及新的ETag。)请参阅RFC2616中的实体标签



当然,即使内容已更改,服务器也可能在说谎,并发送相同的ETag。除非你看,否则没有办法知道。


In my application, I fetch webpages periodically using LWP. Is there anyway to check whether between two consecutive fetches the webpage has got changed in some respect (other than explicitly doing a comparison) ? Is there any signature(say CRC) that is being generated at lower protocol layers which can be extracted and compared against older signatures to see possible changes ?

解决方案

There are two possible approaches. One is to use a digest of the page, e.g.

use strict;
use warnings;

use Digest::MD5 'md5_hex';
use LWP::UserAgent;

# fetch the page, etc.
my $digest = md5_hex $response->decoded_content;

if ( $digest ne $saved_digest ) { 
    # the page has changed.
}

Another option is to use an HTTP ETag, if the server provides one for the resource requested. You can simply store it and then set your request headers to include an If-None-Match field on subsequent requests. If the server ETag has remained the same, you'll get a 304 Not Modified status and an empty response body. Otherwise you'll get the new page. (And new ETag.) See Entity Tags in RFC2616.

Of course, the server could be lying, and sending the same ETag even though the content has changed. There's no way to know unless you look.

这篇关于如何检测更改的网页?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆