算法来确定一个文件的身份 [英] Algorithm for determining a file's identity

查看:146
本文介绍了算法来确定一个文件的身份的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于一个开源项目,我有我的文件系统上面写了一个抽象层。

For an open source project I have I am writing an abstraction layer on top of the filesystem.

这层允许我连接的元数据和关系到每个文件。

This layer allows me to attach metadata and relationships to each file.

我想处理文件的图层命名已摆好,并保存元数据,如果一个文件被重命名/移动或复制。

I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.

要做到这一点,我将需要一种机制,用于计算一个文件的标识。显而易见的解决方案是计算的SHA1哈希为每个文件,然后针对该散列分配的元数据。但是......那是真贵,尤其是对于电影。

To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.

所以,我一直在想一个算法,尽管不是100%正确的将是正确的大多数时间,而且是廉价的。

So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.

一个这样的算法可以是使用文件大小和字节的样本该文件以计算哈希值。

One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.

的字节我应该选择的样本?如何保持计算价格便宜,比较准确?我知道有一个权衡在这里,但性能是至关重要的。且用户将能够处理的情况下,系统会犯错误。

Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.

我需要这个算法工作的非常大的文件(1GB +和小文件5K)

I need this algorithm to work for very large files (1GB+ and tiny files 5K)

修改

我需要这个算法工作在NTFS和所有SMB共享(Linux或Windows为基础),我想它来支持,其中一个文件从一个位置复制到另一个情况(2物理副本存在被视为一个身份) 。我甚至可以考虑希望在此情况下,MP3文件被重新标记工作(物理文件被更改,所以我可以在一个文件类型的身份提供)。

I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).

编辑2

相关问题:<一href="http://stackoverflow.com/questions/788761/algorithm-for-determining-a-files-identity-optimisation">Algorithm为确定文件的身份(优化)

推荐答案

如何保存一些随机整数r <子>我,并查找字节(R <子>我模N)其中n是文件的大小?对于头文件,你可以先忽略他们,然后执行此过程中的剩余字节。

How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.

如果你的文件实际上是pretty的不同(不只是在一个单字节差的地方,而是说,至少有1%的不同),然后随机选择的字节会注意到这一点。例如,在字节1%的差异,100随机字节将失败的概率是1 / E〜37%,以通知;越来越多的字节,你看数量使得这种概率下降成倍增长。

If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.

使用随机字节背后的想法是,他们基本上保证(当然,概率来说)是不如字节的任何其它序列,但他们的没有的容易出现的一些问题与其它序列(例如,发生在看其中该字节需要为0或东西的文件格式的每256个字节)。

The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).

一些更多的建议:

  • 而不是抓字节,抢到更大的块来证明求的成本。
  • 我会建议一直在寻找的第一个块左右的文件。由此,可以判断文件类型和等。 (例如,你可以使用文件程序。)
  • 在至少权衡像整个文件的CRC的成本/效益。它并不像贵作为一个真正的加密散列函数,但仍需要读取整个文件。有利的一面是它的将会的通知单字节的差异。
  • Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
  • I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
  • At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.

这篇关于算法来确定一个文件的身份的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆