如何查看网页内容是否已更改? [英] How to check if content of webpage has been changed?

查看:302
本文介绍了如何查看网页内容是否已更改?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,如果网站上的内容发生更改,我会尝试运行一些代码(Python 2.7),否则请稍等片刻,然后再进行检查.

Basically I'm trying to run some code (Python 2.7) if the content on a website changes, otherwise wait for a bit and check it later.

我正在考虑比较哈希,问题是如果页面更改了单个字节或字符,则哈希将有所不同.因此,例如,如果页面在页面上显示当前日期,则每次哈希值都会不同,并告诉我内容已更新.

I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.

所以...你会怎么做?您会看一下HTML的 Kb 大小吗?您是否会查看字符串的长度,并检查例如长度的变化是否超过 5%,内容是否已更改"?还是有某种哈希算法,如果仅更改了字符串/内容的一小部分,则哈希保持不变?

So... How would you do this? Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

关于最后修改-不幸的是,并非所有服务器都能正确返回该日期.我认为这不是可靠的解决方案.我认为更好的方法-结合使用哈希和内容长度解决方案.检查哈希值,如果它已更改-检查字符串长度.

About last-modified - unfortunately not all servers return this date correctly. I think it is not reliable solution. I think better way - combine hash and content length solution. Check hash, and if it changed - check string length.

推荐答案

如果您要制作一个可应用于任意网站的工具,那么仍然可以从使其适用于某些特定网站的方式开始-下载他们反复地找出您想忽略的确切差异,试图在不忽略有意义差异的情况下合理地一般性地处理这些问题.如此快速的动手采样应该为您提供有关所面临挑战的更多具体想法.无论您尝试哪种解决方案,都可以针对不断增加的站点进行测试,并随时进行调整.

If you're trying to make a tool that can be applied to arbitrary sites, then you could still start by getting it working for a few specific ones - downloading them repeatedly and identifying exact differences you'd like to ignore, trying to deal with the issues reasonably generically without ignoring meaningful differences. Such a quick hands-on sampling should give you much more concrete ideas about the challenge you face. Whatever solution you attempt, test it against increasing numbers of sites and tweak as you go.

您会看一下HTML的Kb大小吗?您是否可以查看字符串的长度,并检查例如长度的变化是否超过5%,内容是否已更改"?

Would you look at the Kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"?

那是令人难以置信的粗略的解释,如果可能的话,我会避免这样做.但是,您确实需要权衡误认为页面未更改和误认为页面已更改的成本.

That's incredibly rough, and I'd avoid that if at all possible. But, you do need to weigh up the costs of mistakenly deeming a page unchanged vs. mistakenly deeming it changed.

或者是否存在某种哈希算法,如果仅更改了字符串/内容的一小部分,则哈希保持不变?

Or is there some kind of hashing algorithm where the hashes stay the same if only small parts of the string/content has been changed?

可以进行这样的哈希处理",但是很难调整文档中有意义的更改的敏感度.无论如何,作为一个例子:您可以按文档中的256个可能的字节值的频率对其进行排序,并考虑使用2k的哈希值:稍后您可以执行"diff"操作以查看该字节值的顺序在以后的下载中已更改了多少. (为节省内存,您可能只做可打印的ASCII值,甚至在标准化大写字母后只做字母就可以了.)

You can make such a "hash", but it's very hard to tune the sensitivity to meaningful change in the document. Anyway, as an example: you could sort the 256 possible byte values by their frequency in the document and consider that a 2k hash: you can later do a "diff" to see how much that byte value ordering's changed in a later download. (To save memory, you might get away with doing just the printable ASCII values, or even just letters after standardising capitalisation).

另一种方法是为文档的不同切片生成一组哈希值:将其分为标题与正文,正文,标题级别和段落,直到您至少具有所需的粒度级别(例如30片).然后,您可以说,如果只更改了30个切片中的2个,您将认为该文档是相同的.

An alternative is to generate a set of hashes for different slices of the document: e.g. dividing it into header vs. body, body by heading levels then paragraphs, until you've got at least a desired level of granularity (e.g. 30 slices). You can then say that if only 2 slices of 30 have changed you'll consider the document the same.

您还可以尝试在散列之前替换某些类型的内容-例如使用正则表达式匹配将时间替换为"<time>".

You might also try replacing certain types of content before hashing - e.g. use regular expression matching to replace times with "<time>".

您还可以执行一些操作,例如降低容忍度,以随着自上次处理页面以来的时间增加而进行更多更改,这可以减少或限制误认为页面未更改的成本".

You could also do things like lower the tolerance to change more as the time since you last processed the page increases, which could lessen or cap the "cost" of mistakenly deeming it unchanged.

这篇关于如何查看网页内容是否已更改?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆