有没有更快的方法(比此更快)在Python中计算文件的哈希(使用hashlib)? [英] Is there a faster way (than this) to calculate the hash of a file (using hashlib) in Python?

查看:77
本文介绍了有没有更快的方法(比此更快)在Python中计算文件的哈希(使用hashlib)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前的做法是:

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    with open(path, 'rb') as f:
         for block in iter(lambda: f.read(1024*func.block_size, b''):
             func.update(block)
    return func.hexdigest()

在i5 @ 1.7 GHz上计算842MB iso文件的md5sum大约需要3.5秒.我尝试了多种读取文件的方法,但是所有方法产生的结果都较慢.也许有更快的解决方案?

It takes about 3.5 seconds to calculate the md5sum of a 842MB iso file on an i5 @ 1.7 GHz. I have tried different methods of reading the file, but all of them yield slower results. Is there, perhaps, a faster solution?

我将2**16(在f.read()内)替换为1024*func.block_size,因为hashlib支持的大多数哈希函数的默认block_size64("sha384"和"sha512"除外-对于他们,默认的block_size128).因此,块大小仍然相同(65536位).

I replaced 2**16 (inside the f.read()) with 1024*func.block_size, since the default block_size for most hashing functions supported by hashlib is 64 (except for 'sha384' and 'sha512' - for them, the default block_size is 128). Therefore, the block size is still the same (65536 bits).

EDIT(2):我做错了.它需要8.4秒而不是3.5秒. :(

EDIT(2): I did something wrong. It takes 8.4 seconds instead of 3.5. :(

EDIT(3):显然,当我再次运行该功能时,Windows使用磁盘的比例为+ 80%.确实需要3.5秒. ew.

EDIT(3): Apparently Windows was using the disk at +80% when I ran the function again. It really takes 3.5 seconds. Phew.

另一种解决方案(〜-0.5秒,速度稍快)是使用os.open():

Another solution (~-0.5 sec, slightly faster) is to use os.open():

def get_hash(path=PATH, hash_type='md5'):
    func = getattr(hashlib, hash_type)()
    f = os.open(path, (os.O_RDWR | os.O_BINARY))
    for block in iter(lambda: os.read(f, 2048*func.block_size), b''):
        func.update(block)
    os.close(f)
    return func.hexdigest()

请注意,这些结果并非最终结果.

Note that these results are not final.

推荐答案

使用md5 openssl工具需要874毫秒的MiB随机数据文件,我能够提高速度,如下所示.

Using an 874 MiB random data file which required 2 seconds with the md5 openssl tool I was able to improve speed as follows.

  • 使用第一种方法需要21秒.
  • 读取整个文件(21秒)以进行缓冲,然后更新需要2秒.
  • 使用以下具有8096缓冲区大小的功能需要17秒.
  • 使用以下缓冲区大小为32767的函数需要11秒钟.
  • 使用以下缓冲区大小为65536的功能需要8秒钟.
  • 使用以下缓冲区大小为131072的功能需要8秒钟.
  • 使用以下具有1048576缓冲区大小的功能需要12秒钟.

def md5_speedcheck(path, size): pts = time.process_time() ats = time.time() m = hashlib.md5() with open(path, 'rb') as f: b = f.read(size) while len(b) > 0: m.update(b) b = f.read(size) print("{0:.3f} s".format(time.process_time() - pts)) print("{0:.3f} s".format(time.time() - ats))

def md5_speedcheck(path, size): pts = time.process_time() ats = time.time() m = hashlib.md5() with open(path, 'rb') as f: b = f.read(size) while len(b) > 0: m.update(b) b = f.read(size) print("{0:.3f} s".format(time.process_time() - pts)) print("{0:.3f} s".format(time.time() - ats))

我上面提到的是人类的时间.而所有这些处理器的时间大约相同,只是在IO阻塞方面有所不同.

Human time is what I noted above. Whereas processor time for all of these is about the same with the difference being taken in IO blocking.

这里的关键决定因素是缓冲区的大小足以减轻磁盘延迟,但又足够小以避免VM页面交换.对于我的特定计算机,看来64 KiB大约是最佳的.

The key determinant here is to have a buffer size that is big enough to mitigate disk latency, but small enough to avoid VM page swaps. For my particular machine it appears that 64 KiB is about optimal.

这篇关于有没有更快的方法(比此更快)在Python中计算文件的哈希(使用hashlib)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆