检查文件是否相同的最快方法是什么? [英] What is the fastest way to check if files are identical?

查看:198
本文介绍了检查文件是否相同的最快方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果您有1,000,0000个源文件,您怀疑它们都是相同的,您想要比较它们当前的快速方法是什么来比较这些文件?假设它们是Java文件,并且进行比较的平台不重要。 cksum让我哭泣。

更新:我知道生成校验和。 diff是可笑的...我想要速度。



更新:不要停留在他们是源文件的事实。假装例如你花了一百万次程序与非常规范的输出。你想证明所有1,000,000个版本的输出都是一样的。



更新:读取块数而不是字节数?立即抛出那些?这是否比找到字节数更快?



更新:这是否是最快的方式来比较两个文件? >

解决方案

我选择类似于 cmp 程序所采用的方法:打开两个文件(例如文件1和文件2),从每个文件读取一个块,并逐字节比较它们。如果它们匹配,读取每个块中的下一个块,逐个字节比较等。如果在没有检测到任何差异的情况下到达这两个文件的结尾,寻找文件1的开头,关闭文件2并打开文件3在其位置,并重复,直到您检查所有文件。我不认为有任何方法,以避免读取所有文件的所有字节,如果他们实际上都是相同的,但我认为这种方法是(或接近)最快的方式来检测可能存在的任何差异。



OP修改:提升了 Mark的重要注释Bessey



另一个明显的优化,如果文件预计大多是相同的,如果它们相对较小,是保持其中一个文件完全在内存中,这样可以减少尝试同时读取两个文件的麻烦。


If you have 1,000,0000 source files, you suspect they are all the same, and you want to compare them what is the current fasted method to compare those files? Assume they are Java files and platform where the comparison is done is not important. cksum is making me cry. When I mean identical I mean ALL identical.

Update: I know about generating checksums. diff is laughable ... I want speed.

Update: Don't get stuck on the fact they are source files. Pretend for example you took a million runs of a program with very regulated output. You want to prove all 1,000,000 versions of the output are the same.

Update: read the number of blocks rather than bytes? Immediatly throw out those? Is that faster than finding the number of bytes?

Update: Is this ANY different than the fastest way to compare two files?

解决方案

I'd opt for something like the approach taken by the cmp program: open two files (say file 1 and file 2), read a block from each, and compare them byte-by-byte. If they match, read the next block from each, compare them byte-by-byte, etc. If you get to the end of both files without detecting any differences, seek to the beginning of file 1, close file 2 and open file 3 in its place, and repeat until you've checked all files. I don't think there's any way to avoid reading all bytes of all files if they are in fact all identical, but I think this approach is (or is close to) the fastest way to detect any difference that may exist.

OP Modification: Lifted up important comment from Mark Bessey

"another obvious optimization if the files are expected to be mostly identical, and if they're relatively small, is to keep one of the files entirely in memory. That cuts way down on thrashing trying to read two files at once."

这篇关于检查文件是否相同的最快方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆