如何有效地识别的二进制文件 [英] How to efficiently identify a binary file
问题描述
什么是最有效的方法来识别二进制文件?我想提取某种签名的二进制文件,并用它来它与别人比较。
What's the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.
蛮力的方法是使用整个文件作为签名,这会花费太长的时间和太多的记忆。我在寻找一个更聪明的办法处理这一问题,我愿意牺牲一点精度(但不要太多,EY)性能。
The brute-force approach would be to use the whole file as a signature, which would take too long and too much memory. I'm looking for a smarter approach to this problem, and I'm willing to sacrifice a little accuracy (but not too much, ey) for performance.
(而Java code-例子是preferred,鼓励语言不可知的答案)
(while Java code-examples are preferred, language-agnostic answers are encouraged)
修改:扫描整个文件来创建一个散列具有较大的文件时,时间越长的缺点。因为散列不会是唯一的,无论如何,我在想,如果有一个更有效的方法(即:从一个字节均匀分布采样哈希)。
Edit: Scanning the whole file to create a hash has the disadvantage that the bigger the file, the longer it takes. Since the hash wouldn't be unique anyway, I was wondering if there was a more efficient approach (ie: a hash from an evenly distributed sampling of bytes).
推荐答案
我找到有效的这样的事情的一种方法是计算两个SHA-1散列。一个用于在一个文件中的第一个块(Ⅰ任意拾取512个字节作为一个块大小),一个用于整个文件。然后,我还有一个文件大小存储两个散列。当我需要找出一个文件还是先比较一下文件长度。如果长度匹配的话,我会比较的第一个块的哈希值,如果匹配我比较了整个文件的哈希值。前两个测试很快淘汰掉了很多不匹配的文件。
An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.
这篇关于如何有效地识别的二进制文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!