md5和大文件 [英] md5 and large files

查看:124
本文介绍了md5和大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些大文件(介于2和4 GB之间),我想做一些

的东西。这就是我在Python中使用md5模块的方式:


original = file(path + f,''rb'')

data = original.read(4096)

original.close()

verify = md5.new(data)

print verify.hexdigest (),f


读取文件的前4096个字节并计算md5总和

基于足以唯一识别文件或者我

这完全错了吗?任何建议或想法都赞赏。

解决方案

2个文件似乎可能具有相同的4k前导码。

例如,包含16kfile1的unix tar文件。然后是1k

" file2"将具有与包含

16k" file1"的unix tar文件相同的前导字节。和1kfile3,因此第一个

4k的md5sum将匹配。 (这两个tar文件也会有相同的字节

长度)


如果某些网站上的所有页面都开始

< HTML>

< HEAD>

< SCRIPT>这里javascript的页面和页面(至少4k)< / SCRIPT>

< TITLE> ...

最初的4k也可能匹配。


但无论如何,如果s1!= s2,那么哈希值(s1)!= hash(s2)应该是
很小,这不应该取决于字符串的长度。


Jeff

----- BEGIN PGP SIGNATURE -----

版本:GnuPG v1.2.6(GNU / Linux)

iD8DBQFBcqPpJd01MZaTXX0RAuSzAKCYYqLknLWNhw7hmDlSJt 8oXROABgCfXeuJ

PtEib20kpSOeazD1TfwdYRo =

= gnaA

----- END PGP SIGNATURE -----


Brad Tilley写道:

正在读取文件的前4096个字节,并根据足以唯一识别文件的数量来计算md5总和
或者我是否完全错了?任何建议或想法都赞赏。




显然,您需要使用相同的程序进行后续验证。通常的方法是计算整个文件的md5sum。


这是否足够取决于你想要达到的目标:

- 唯一标识文件:如果有一些

保证在第一个

4096字节内没有两个这样的文件是相同的,这是可靠的。如果您的文件是,例如,日志文件具有不同的起始

日期,并且日志文件行包含开始日期,则这是一个

安全假设。如果这些是基本上相同文件的不同版本(例如,相同源代码的不同编译),我不会打赌不同的文件在第一个文件中已经不同了br />
4096字节。


- 验证文件是否已损坏,被篡改,修改。

您的方法显然不足,因为它只能检测到前4096字节内的
修改。


问候,

Martin


Martin v.L?wis写道:

Brad Tilley写道:

正在读取文件的前4096个字节和根据足以唯一识别文件的数量来计算md5
总和或者我完全错了吗?任何建议或想法都赞赏。



显然,您需要使用相同的程序进行后续验证。通常的方法是计算整个文件的md5sum。

这是否足够取决于你想要实现的目标:
- 唯一标识文件:这是有效的可靠的,如果有一些保证在第一个4096字节内没有两个这样的文件是相同的。如果您的文件是具有不同起始日期的日志文件,并且日志文件行包含开始日期,则这是一个安全的假设。如果这些是基本上相同文件的不同版本(例如,相同源代码的不同编译),我不会打赌不同的文件在第一个4096字节内已经不同。 />
- 验证文件是否已损坏,被篡改,修改。
您的方法显然不够,因为它只能在前4096字节内检测到修改。




我想验证文件是否已损坏,那么在4GB文件上计算md5总和的最有效方法是什么?进行

计算的机器是一个256MB RAM的小型桌面。


I have some large files (between 2 & 4 GB) that I want to do a few
things with. Here''s how I''ve been using the md5 module in Python:

original = file(path + f, ''rb'')
data = original.read(4096)
original.close()
verify = md5.new(data)
print verify.hexdigest(), f

Is reading the first 4096 bytes of the files and calculating the md5 sum
based on that sufficient for uniquely identifying the files or am I
going about this totally wrong? Any advice or ideas appreciated.

解决方案

It seems likely that 2 files would have the same 4k "preamble".

For instance, a unix tar file containing a 16k "file1" and then a 1k
"file2" would have the same leading bytes as a unix tar file containing
a 16k "file1" and a 1k "file3", and therefore the md5sum over the first
4k would match. (these two tar files would also have the same byte
length)

If all pages on some website begin
<HTML>
<HEAD>
<SCRIPT> pages and pages of javascript here (at least 4k) </SCRIPT>
<TITLE> ...
the initial 4k might match, too.

But anyway, if s1 != s2, then the odds that hash(s1) != hash(s2) should
be small, and that shouldn''t depend on the length of the string.

Jeff

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBcqPpJd01MZaTXX0RAuSzAKCYYqLknLWNhw7hmDlSJt 8oXROABgCfXeuJ
PtEib20kpSOeazD1TfwdYRo=
=gnaA
-----END PGP SIGNATURE-----


Brad Tilley wrote:

Is reading the first 4096 bytes of the files and calculating the md5 sum
based on that sufficient for uniquely identifying the files or am I
going about this totally wrong? Any advice or ideas appreciated.



Clearly, you need to use the same procedure for later verification. The
usual approach is to compute the md5sum for the entire file.

Whether this is sufficient somewhat depends on what you want to achieve:
- uniquely identify the file: this works reliable if there is some
guarantee that no two such files will be identical within the first
4096 bytes. If your files are, say, log files with different starting
dates, and the log file lines contain the starting dates, this is a
safe assumption. If these are different versions of essentially the
same file (e.g. different compilations of the same source code), I
would not bet that different files already differ within the first
4096 bytes.

- verify that the file is not corrupted, tampered with, modified.
Your approach is clearly insufficient, as it can only detect
modifications within the first 4096 bytes.

Regards,
Martin


Martin v. L?wis wrote:

Brad Tilley wrote:

Is reading the first 4096 bytes of the files and calculating the md5
sum based on that sufficient for uniquely identifying the files or am
I going about this totally wrong? Any advice or ideas appreciated.


Clearly, you need to use the same procedure for later verification. The
usual approach is to compute the md5sum for the entire file.

Whether this is sufficient somewhat depends on what you want to achieve:
- uniquely identify the file: this works reliable if there is some
guarantee that no two such files will be identical within the first
4096 bytes. If your files are, say, log files with different starting
dates, and the log file lines contain the starting dates, this is a
safe assumption. If these are different versions of essentially the
same file (e.g. different compilations of the same source code), I
would not bet that different files already differ within the first
4096 bytes.

- verify that the file is not corrupted, tampered with, modified.
Your approach is clearly insufficient, as it can only detect
modifications within the first 4096 bytes.



I would like to verify that the files are not corrupt so what''s the most
efficient way to calculate md5 sums on 4GB files? The machine doing the
calculations is a small desktop with 256MB of RAM.


这篇关于md5和大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆