检测重复文件 [英] Detecting duplicate files

查看:139
本文介绍了检测重复文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想检测目录树中的重复文件。当找到两个相同的文件时,将仅保留一个重复项,其余的重复项将被删除以节省磁盘空间。

I'd like to detect duplicate files in a directory tree. When two identical files are found only one of the duplicates will be preserved and the remaining duplicates will be deleted to save the disk space.

重复项意味着文件内容相同

The duplicate means files having the same content which may differ in file names and path.

我当时正在考虑使用散列算法,但是有可能不同文件具有相同的哈希值,所以我需要一些额外的机制告诉我即使哈希相同,文件也不一样,因为我不想删除两个不同的文件。

I was thinking about using hash algorithms for this purpose but there is a chance that different files have the same hashes, so I need some additional mechanism to tell me that the files aren't the same even though the hashes are the same because I don't want to delete two different files.

还有哪些您将使用快速可靠的机制吗?

Which additional fast and reliable mechanism would you use?

推荐答案

计算哈希将使程序运行缓慢。最好检查文件大小。所有重复文件应具有相同的文件大小。如果它们共享相同的文件大小,则应用哈希检查。

Calculating hash will make your program run slow. Its better you also check the file size. All the duplicate file should have same file size. If they share same file size apply hash check. It'll make your program perform fast.

可以有更多步骤。


  1. 检查文件大小是否相等

  2. 如果步骤1通过,请检查字节的第一个和最后一个范围(例如100字节) )等于

  3. 如果步骤2通过,请检查文件类型

  4. 如果步骤3通过,请检查最后散列

  1. Check if file size is equal
  2. If step 1 passes, check if first and last range of bytes (say 100 bytes) are equal
  3. If step 2 passes, check file type,
  4. If step 3 passes, check the hash at last

添加的标准越多,其执行速度就越快,并且可以避免万不得已(哈希)。

The more criteria you add the more faster it'll perform and you can avoid the last resort (hash) this way.

这篇关于检测重复文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆