算法来找到重复 [英] algorithm to find duplicates

查看:147
本文介绍了算法来找到重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有什么著名的算法,有效地找到重复?

Are there any famous algorithms to efficiently find duplicates?

有关如假设如果我有数千张照片,这些照片被命名为唯一的名称。有可能是重复的可能存在于不同的子文件夹的机会。是使用std ::地图或其他任何散列地图是一个好主意?

For e.g. Suppose if I have thousands of photos and the photos are named with unique names. There could be chances that duplicate could exist in different sub-folders. Is using std::map or any other hash-maps is a good idea?

推荐答案

如果您处理的文件,一个想法是先验证文件的lenght,然后生成一个散列只为具有相同大小的文件。

If your dealing with files, one idea is to first verify the file's lenght, and then generate a hash just for the files that have the same size.

然后,只需比较文件的哈希值。如果它们是相同的,你有一个重复的文件。

Then just compare the file's hashes. If they're the same, you've got a duplicate file.

还有安全性和准确性之间的权衡:有可能发生的事情,谁知道,到具有相同的散列不同的文件。所以,你可以提高你的解决方案:生成一个简单,快捷的哈希查找复本。当他们是不同的,你有不同的文件。当它们相等,生成第二哈希值。如果第二哈希是不同的,你只要有一个假阳性。如果他们再次相等,可能是你有一个真正的副本。

There's a tradeoff between safety and accuracy: there might happen, who knows, to have different files with the same hash. So you can improve your solution: generate a simple, fast hash to find the dups. When they're different, you have different files. When they're equal, generate a second hash. If the second hash is different, you just had a false positive. If they're equal again, probably you have a real duplicate.

在换句话说:

generate file sizes
for each file, verify if there's some with the same size.
if you have any, then generate a fast hash for them.
compare the hashes.
If different, ignore.
If equal: generate a second hash.
Compare.
If different, ignore.
If equal, you have two identical files.

做一个哈希每个文件需要太多的时间,将是无用的,如果大部分的文件是不同的。

Doing a hash for every file will take too much time and will be useless if most of your files are different.

这篇关于算法来找到重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆