检测重复文件的最快算法 [英] Fastest algorithm to detect duplicate files

查看:56
本文介绍了检测重复文件的最快算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的 2 TB 硬盘存储图像中查找重复项的过程中,我对 fslint 和 fslint-gui 工具的长时间运行感到惊讶.
所以我分析了核心工具 findup 的内部结构,它被实现为使用超长管道编写和记录的非常好的 shell 脚本.本质上它基于查找和散列(md5 和 SHA1).作者表示它比我无法相信的任何其他替代方案都快.所以我发现检测重复文件,其中主题很快滑向散列和比较不是最好的散列我认为最快的方式.

In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1). The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.

所以通常的算法似乎是这样工作的:

So the usual algorithm seems to work like this:

  • 生成所有文件的排序列表(路径、大小、ID)
  • 将大小完全相同的文件分组
  • 计算所有相同大小的文件的哈希值并比较哈希值
  • same has 意味着相同的文件 - 发现重复

有时通过首先使用具有更高冲突概率的更快哈希算法(如 md5)来提高速度,然后如果哈希相同,则使用第二个更慢但更少类似冲突的算法来证明重复项.另一个改进是首先只散列一小块以整理出完全不同的文件.

Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.

所以我认为这个方案在两个不同的方面被打破:

So I've got the opinion that this scheme is broken in two different dimensions:

  • 重复的候选者再次从慢速硬盘读取(第一个块)并再次(完整的 md5)再一次(sha1)
  • 通过使用散列而不是逐字节比较文件,我们引入了(低)误报概率
  • 哈希计算比逐字节比较慢很多

我发现了一个 (Windows) 应用,它声称不使用这种常见的散列方案速度很快.

I found one (Windows) app which states to be fast by not using this common hashing scheme.

我的想法和观点是否完全错误?

Am I totally wrong with my ideas and opinion?

[更新]

似乎有人认为散列可能比比较更快.但这似乎是对哈希表加速事物"的一般使用的误解.但是要生成文件的哈希,第一次文件需要逐字节读取.因此,一方面有逐字节比较,它只比较每个重复候选函数的这么多字节,直到第一个不同的位置.并且有一个哈希函数,它从这么多字节中生成一个 ID - 如果前 10k 是相同的,可以说 1 TB 的前 10k 字节或完整 TB.因此,假设我通常没有准备好计算并自动更新的所有文件哈希表,我需要计算哈希并读取重复候选的每个字节.逐字节比较不需要这样做.

There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.

[更新 2]

我得到了第一个答案,它再次指向了这个方向:哈希通常是一个好主意";并且出于这种考虑(不是那么错误),试图将散列的使用与(恕我直言)错误的论点合理化.哈希更好或更快,因为您可以稍后重用它们";不是问题."假设许多(比如 n 个)文件具有相同的大小,要找到哪些是重复的,您需要进行 n * (n-1)/2 次比较以对它们进行成对测试.使用强散列,您只需要对每个散列进行一次散列,总共为您提供 n 个散列."也偏向于哈希和错误(恕我直言).为什么我不能从每个相同大小的文件中读取一个块并在内存中进行比较?如果我必须比较 100 个文件,我会打开 100 个文件句柄并从每个文件句柄并行读取一个块,然后在内存中进行比较.这比用这 100 个文件更新一个或多个复杂的慢哈希算法要快得多.

I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question. "Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.

[更新 3]

鉴于对人们应该始终使用哈希函数,因为它们非常好!"的巨大偏见.我通读了一些关于哈希质量的 SO 问题,例如这个:哪种哈希算法最适合唯一性和速度? 看来,由于糟糕的设计和生日,常见的哈希函数更经常产生冲突,我们认为这要归功于 悖论.测试集包含:216,553 个英文单词的列表(小写),数字1"至216553"(想想邮政编码,以及一个糟糕的哈希是如何摧毁 msn.com 的)和 216,553 个随机"数据.(即类型 4 uuid)GUID".这些微小的数据集产生于大约 100 到近 20k 次碰撞.因此,仅基于哈希值测试数百万个文件的(不)相等性可能根本不是一个好主意.

Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this: Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase), the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.

我想我需要修改 1 并替换 md5/sha1 带有cmp"的管道部分只是测量时间.我会及时更新.

I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.

[更新 4]感谢所有反馈.慢慢地,我们正在转换.背景是当 fslints findup 在我的机器上运行 md5suming 数百个图像时我观察到的.这花了很长时间,硬盘像地狱一样旋转.所以我想知道这个疯狂的工具在破坏我的硬盘并在逐字节比较时花费大量时间是什么鬼?"是 1) 每字节比任何哈希或校验和算法便宜,2) 逐字节比较我可以尽早返回第一个差异,因此我节省了大量时间,而不是通过读取完整文件和计算来浪费 HDD 带宽和时间对完整文件进行哈希处理.我仍然认为那是正确的 - 但是:我想我没有理解 1:1 比较(if (file_a[i] != file_b[i]) return 1;)可能比每字节散列更便宜的观点.但是当需要相互比较更多文件时,使用 O(n) 的复杂性散列可能会获胜.我已经在我的列表中设置了这个问题,并计划用 cmp 替换 findup 的 fslint 的 md5 部分,或者增强 pythons filecmp.py compare lib,它一次只将 2 个文件与多个文件选项进行比较,可能还有一个 md5hash 版本.所以谢谢大家.通常情况就像你们说的:最好的方法(TM)完全取决于情况:HDD vs SSD,相同长度文件的可能性,重复文件,典型文件大小,CPU性能与内存与磁盘,单个与多核等.而且我了解到我应该考虑更频繁地使用哈希 - 但我是一名嵌入式开发人员,大部分时间资源非常有限;-)

[Update 4] Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version. So thank you all for the moment. And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)

感谢您的努力!马塞尔

推荐答案

最快的重复数据删除算法取决于几个因素:

The fastest de-duplication algorithm will depend on several factors:

  1. 发现近似重复的频率有多高?如果非常频繁地找到数百个具有完全相同内容和一个字节差异的文件,这将使强散列更具吸引力.如果很难找到超过一对大小相同但内容不同的文件,则可能不需要散列.
  2. 从磁盘读取的速度有多快,文件有多大?如果从磁盘读取非常慢或文件非常小,那么无论加密强度如何,单遍散列都比使用弱散列进行小遍传递要快,然后只有在弱散列匹配时才进行更强的传递.
  3. 您打算运行该工具多少次?如果您要多次运行它(例如,持续进行重复数据删除),则使用路径、大小和大小构建索引.每个文件的 strong_hash 可能是值得的,因为您不需要在该工具的后续运行中重建它.
  4. 是否要检测重复的文件夹?如果你想这样做,你可以构建一个默克尔树(本质上是文件夹的内容 + 其元数据);并将这些哈希值也添加到索引中.
  5. 您如何处理文件权限、修改日期、ACL 和其他排除实际内容的文件元数据?这与算法速度没有直接关系,但在选择如何处理重复项时会增加额外的复杂性.
  1. how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
  2. how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
  3. how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
  4. do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
  5. what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.

因此,没有单一的方法可以回答原始问题.什么时候最快?

假设两个文件具有相同的大小,一般来说,没有比逐字节比较它们更快的方法来检测它们是否重复(即使从技术上讲逐块比较它们,因为文件系统在读取块时比单个字节更有效).

Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).

假设许多(比如n)文件具有相同的大小,要查找重复的文件,您需要使n * (n-1)/2 比较以对它们进行成对测试.使用强散列,您只需要对每个散列进行一次散列,总共为您提供 n 个散列.即使散列需要 k 倍于逐字节比较,当 k > 时散列更好.(n-1)/2.散列可能会产生误报(尽管强散列只会以天文数字低的概率这样做),但逐字节测试只会将 k 增加至多 1.当 k=3,只要n>=7,你就领先了;使用更保守的 k=2,您可以通过 n=3 达到收支平衡.在实践中,我希望 k 非常接近 1:从磁盘读取可能比散列读取的任何内容的成本更高.

Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.

多个文件具有相同大小的概率随着文件数量的平方而增加(查找生日悖论).因此,在一般情况下,可以预期散列是一个非常好的主意.如果您再次运行该工具,这也是一个显着的加速,因为它可以重用现有索引而不是构建它重新.因此,将 1 个新文件与 1M 个现有的、不同的、相同大小的索引文件进行比较,预计在索引中进行 1 次哈希 + 1 次查找,而在无哈希、无索引的情况下进行 1M 次比较:估计为 1M 次快点!

The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!

请注意,您可以使用多级散列重复相同的参数:如果您使用非常快的散列,例如第一个、中央和最后一个 1k 字节,它将很多 散列比比较文件更快(k <1 上面) - 但你会期待碰撞,并使用强散列和/或逐字节进行第二次传递 -找到时进行字节比较.这是一个权衡:您打赌会有差异,这将节省您进行完整哈希或完整比较的时间.我认为一般来说这是值得的,但是最佳"答案取决于机器的具体情况和工作负载.

Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.

[更新]

OP 似乎给人的印象是

The OP seems to be under the impression that

  • 哈希计算很慢
  • 快速散列会产生冲突
  • 使用散列总是需要读取完整的文件内容,因此对于第 1 个字节不同的文件来说是多余的.

我添加了这个部分来反驳这些论点:

I have added this segment to counter these arguments:

  • 强哈希 (sha1) 大约需要 每字节 5 个周期来计算,或者大约现代 CPU 上每字节 15ns.旋转硬盘或固态硬盘的磁盘延迟分别约为 75k ns 和 5M ns.您可以在开始从 SSD 读取数据的时间内散列 1k 数据.一个更快的非加密散列,meowhash,每个周期可以散列 1 个字节.主内存延迟大约为 120 ns - 完成单个访问非缓存内存请求所需的时间很容易有 400 个周期.
  • 2018 年,SHA-1 中唯一已知的冲突来自 shattered 项目,该项目占用了大量资源计算.其他强哈希算法的速度并不慢,但更强大 (SHA-3).
  • 你总是可以散列文件的一部分而不是全部;并存储部分散列直到遇到冲突,此时您将计算越来越大的散列,直到在真正重复的情况下,您将对整个事物进行散列.这让您可以更快地构建索引.
  • A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
  • In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
  • You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.

我的观点不是散列是最终目的.正是对于这个应用程序,它非常有用,而不是真正的瓶颈:真正的瓶颈在于实际遍历和读取文件系统的一部分,这比任何散列或比较要慢得多继续其内容.

My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.

这篇关于检测重复文件的最快算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆