是否有一个哈希算法是宽容的细微差别? [英] Is there a hashing algorithm that is tolerant of minor differences?

查看:158
本文介绍了是否有一个哈希算法是宽容的细微差别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在做一些网络爬虫类的东西,我正在寻找某些条款的网页,并在网页上找到自己的位置,然后将其高速缓存以备后用。我希望能够定期检查网页的任何重大变化。类似的MD5可以通过简单地将当前的日期和时间在网页挫败

I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.

是否有任何这样的事情工作的哈希算法?

Are there any hashing algorithms that work for something like this?

推荐答案

做文档相似性的常见方法是的叠瓦,这是有点比散列更为复杂。还应考虑的内容定义的组块的一种方式来分割的文件。

A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.

我读报纸几年前有关使用布鲁姆过滤器相似性检测。 使用布鲁姆过滤器来优化网络搜索结果。这是一个有趣的想法,但我从来没有抽时间去用它进行试验。

I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.

这篇关于是否有一个哈希算法是宽容的细微差别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆