_md5.md5是什么,为什么hashlib.md5这么慢? [英] What is _md5.md5 and why is hashlib.md5 so much slower?

查看:121
本文介绍了_md5.md5是什么,为什么hashlib.md5这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当对缓慢的stdlib hashlib.md5 实现感到沮丧时,发现了这个未公开的 _md5 .

在Macbook上:

 >>>timeit hashlib.md5(b"hello world")每个循环597 ns±17.2 ns(平均±标准偏差,共运行7次,每个循环1000000次)>>>timeit _md5.md5(b"hello world")每个循环224 ns±3.18 ns(平均±标准偏差,共运行7次,每个循环1000000次)>>>_md5<来自'/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so的模块'_md5''> 

在Windows框上:

 >>>timeit hashlib.md5(b垃圾溢出")每个循环328 ns±21.8 ns(平均±标准偏差,共7次运行,每个循环1000000次)>>>timeit _md5.md5(b垃圾溢出")每个循环110 ns±12.5 ns(平均±标准偏差,共运行7次,每个循环10000000个)>>>_md5<模块'_md5'(内置)> 

在Linux机器上:

 >>>timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")每个循环259 ns±1.33 ns(平均±标准偏差,共7次运行,每个1000000个循环)>>>timeit _md5.md5(b"https://adventofcode.com/2016/day/5")每个循环102 ns±0.0576 ns(平均±标准偏差,共7次运行,每个循环10000000个)>>>_md5<来自'/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so的模块'_md5''> 

散列短消息的速度更快.对于长消息,性能类似.

为什么它被隐藏在下划线扩展模块中,为什么默认情况下在hashlib中不使用这种更快的实现?什么是 _md5 模块,为什么它没有公共API?

解决方案

直到 Python2. 5 ,哈希和摘要均在各自的模块中实现(例如

并放大两个图形相交的区域:

Found this undocumented _md5 when getting frustrated with the slow stdlib hashlib.md5 implementation.

On a macbook:

>>> timeit hashlib.md5(b"hello world")
597 ns ± 17.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"hello world")
224 ns ± 3.18 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/Cellar/python/3.7.6_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/lib-dynload/_md5.cpython-37m-darwin.so'>

On a Windows box:

>>> timeit hashlib.md5(b"stonk overflow")
328 ns ± 21.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"stonk overflow")
110 ns ± 12.5 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' (built-in)>

On a Linux box:

>>> timeit hashlib.md5(b"https://adventofcode.com/2016/day/5")
259 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>>> timeit _md5.md5(b"https://adventofcode.com/2016/day/5")
102 ns ± 0.0576 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
>>> _md5
<module '_md5' from '/usr/local/lib/python3.8/lib-dynload/_md5.cpython-38-x86_64-linux-gnu.so'>

For hashing short messages, it's way faster. For long messages, similar performance.

Why is it hidden away in an underscore extension module, and why isn't this faster implementation used by default in hashlib? What is the _md5 module and why doesn't it have public API?

解决方案

Til Python 2.5, the hashes and digests were implemented in their own modules (e.g. [Python 2.Docs]: md5 - MD5 message digest algorithm).
Starting with v2.5, [Python 2.6.Docs]: hashlib - Secure hashes and message digests was added. Its purpose was to:

  1. Offer an unified access method to the hashes / digests (via their name)
  2. Switch (by default) to an external cryptography provider (it seems the logical step to delegate to some entity specialized in that field, as maintaining all those algorithms could be an overkill). At that time OpenSSL was the best choice: mature enough, known and compatible (there were a bunch of similar Java providers, but those were pretty useless)

As a side effect of #2., the Python implementations were hidden from the public API (renamed them: _md5, _sha1, _sha256, _sha512, and the latter ones added: _blake2, _sha3), as redundancy often creates confusions.
But, another side effect was _hashlib.so dependency on OpenSSL's libcrypto*.so (this is Nix (at least Lnx) specific, on Win, a static libeay32.lib was linked in _hashlib.pyd, and also _ssl.pyd (which I consider lame), till v3.7+, where OpenSSL .dlls are part of the Python installation).
Probably on 90+% of the machines things were smooth, as OpenSSL was / is installed by default, but for those where it isn't, many things might get broken because for example hashlib is imported by many modules (one such example is random which itself gets imported by lots of others), so trivial pieces of code that are not related at all to cryptography (at least not at 1st sight) will stop working. That's why the old implementations are kept (but again, they are only fallbacks as OpenSSL versions are / should be better maintained).

[cfati@cfati-ubtu16x64-0:~/Work/Dev/StackOverflow/q059955854]> ~/sopr.sh
*** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ***

[064bit-prompt]> python3 -c "import sys, hashlib as hl, _md5, ssl;print(\"{0:}\n{1:}\n{2:}\n{3:}\".format(sys.version, _md5, hl._hashlib, ssl.OPENSSL_VERSION))"
3.5.2 (default, Oct  8 2019, 13:06:37)
[GCC 5.4.0 20160609]
<module '_md5' (built-in)>
<module '_hashlib' from '/usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so'>
OpenSSL 1.0.2g  1 Mar 2016
[064bit-prompt]>
[064bit-prompt]> ldd /usr/lib/python3.5/lib-dynload/_hashlib.cpython-35m-x86_64-linux-gnu.so
        linux-vdso.so.1 =>  (0x00007fffa7d0b000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f50d9e4d000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f50d9a83000)
        libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f50d963e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f50da271000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f50d943a000)
[064bit-prompt]>
[064bit-prompt]> openssl version -a
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
platform: debian-amd64
options:  bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx)
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -m64 -DL_ENDIAN -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DMD32_REG_T=int -DOPENSSL_IA32_SSE2 -DOPENSSL_BN_ASM_MONT -DOPENSSL_BN_ASM_MONT5 -DOPENSSL_BN_ASM_GF2m -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM -DMD5_ASM -DAES_ASM -DVPAES_ASM -DBSAES_ASM -DWHIRLPOOL_ASM -DGHASH_ASM -DECP_NISTZ256_ASM
OPENSSLDIR: "/usr/lib/ssl"
[064bit-prompt]>
[064bit-prompt]> python3 -c "import _md5, hashlib as hl;print(_md5.md5(b\"A\").hexdigest(), hl.md5(b\"A\").hexdigest())"
7fc56270e7a70fa81a5935b72eacbe29 7fc56270e7a70fa81a5935b72eacbe29

According to [Python 3.Docs]: hashlib.algorithms_guaranteed:

A set containing the names of the hash algorithms guaranteed to be supported by this module on all platforms. Note that ‘md5’ is in this list despite some upstream vendors offering an odd "FIPS compliant" Python build that excludes it.

Below it's an example of a custom Python 2.7 installation (that I built quite a while ago, worth mentioning that it dynamically links to OpenSSL .dlls):

e:\Work\Dev\StackOverflow\q059955854>sopr.bat
*** Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ***

[prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import sys, ssl;print(\"{0:}\n{1:}\".format(sys.version, ssl.OPENSSL_VERSION))"
2.7.10 (default, Mar  8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)]
OpenSSL 1.0.2j-fips  26 Sep 2016

[prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import hashlib as hl;print(hl.md5(\"A\").hexdigest())"
7fc56270e7a70fa81a5935b72eacbe29

[prompt]> "F:\Install\pc064\HPE\OPSWpython\2.7.10__00\python.exe" -c "import ssl;ssl.FIPS_mode_set(True);import hashlib as hl;print(hl.md5(\"A\").hexdigest())"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ValueError: error:060A80A3:digital envelope routines:FIPS_DIGESTINIT:disabled for fips

As for the speed question I can only speculate:

  • Python implementation was (obviously) written specifically for Python, meaning it is "more optimized" (yes, this is grammatically incorrect) for Python than a generic version, and also resides in python*.so (or the python executable itself)
  • OpenSSL implementation resides in libcrypto*.so, and it's being accessed by the wrapper _hashlib.so, which does the back and forth conversions between Python types (PyObject*) and the OpenSSL ones (EVP_MD_CTX*)

Considering the above, it would make sense that the former is (slightly) faster (at least for small messages, where the overhead (function call and other Python underlying operations) takes a significant percentage of the total time compared to the hashing itself). There are also other factors to be considered (e.g. whether OpenSSL assembler speedups were used).



Update #0

Below are some benchmarks of my own.

code00.py:

#!/usr/bin/env python

import sys
from hashlib import md5 as md5_openssl
from _md5 import md5 as md5_builtin
import timeit


def main(*argv):
    base_text = b"A"
    number = 1000000
    print("timeit attempts number: {0:d}".format(number))
    #x = []
    #y = {}
    for count in range(0, 16):
        factor = 2 ** count
        text = base_text * factor
        globals_dict = {"text": text}
        #x.append(factor)
        print("\nUsing a {0:8d} (2 ** {1:2d}) bytes message".format(len(text), count))
        for func in [
            md5_openssl,
            md5_builtin,
        ]:
            globals_dict["md5"] = func

            t = timeit.timeit(stmt="md5(text)", globals=globals_dict, number=number)
            print("    {0:12s} took: {1:11.6f} seconds".format(func.__name__, t))
            #y.setdefault(func.__name__, []).append(t)
    #print(x, y)


if __name__ == "__main__":
    print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    main(*sys.argv[1:])
    print("\nDone.")

Output:

  • Win 10 pc064 (running on a Dell Precision 5510 laptop):

    [prompt]> "e:\Work\Dev\VEnvs\py_pc064_03.07.06_test0\Scripts\python.exe" code00.py
    Python 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)] 64bit on win32
    
    timeit attempts number: 1000000
    
    Using a        1 (2 **  0) bytes message
        openssl_md5  took:    0.449134 seconds
        md5          took:    0.120021 seconds
    
    Using a        2 (2 **  1) bytes message
        openssl_md5  took:    0.460399 seconds
        md5          took:    0.118555 seconds
    
    Using a        4 (2 **  2) bytes message
        openssl_md5  took:    0.451850 seconds
        md5          took:    0.121166 seconds
    
    Using a        8 (2 **  3) bytes message
        openssl_md5  took:    0.438398 seconds
        md5          took:    0.118127 seconds
    
    Using a       16 (2 **  4) bytes message
        openssl_md5  took:    0.454653 seconds
        md5          took:    0.122818 seconds
    
    Using a       32 (2 **  5) bytes message
        openssl_md5  took:    0.450776 seconds
        md5          took:    0.118594 seconds
    
    Using a       64 (2 **  6) bytes message
        openssl_md5  took:    0.555761 seconds
        md5          took:    0.278812 seconds
    
    Using a      128 (2 **  7) bytes message
        openssl_md5  took:    0.681296 seconds
        md5          took:    0.455921 seconds
    
    Using a      256 (2 **  8) bytes message
        openssl_md5  took:    0.895952 seconds
        md5          took:    0.807457 seconds
    
    Using a      512 (2 **  9) bytes message
        openssl_md5  took:    1.401584 seconds
        md5          took:    1.499279 seconds
    
    Using a     1024 (2 ** 10) bytes message
        openssl_md5  took:    2.360966 seconds
        md5          took:    2.878650 seconds
    
    Using a     2048 (2 ** 11) bytes message
        openssl_md5  took:    4.383245 seconds
        md5          took:    5.655477 seconds
    
    Using a     4096 (2 ** 12) bytes message
        openssl_md5  took:    8.264774 seconds
        md5          took:   10.920909 seconds
    
    Using a     8192 (2 ** 13) bytes message
        openssl_md5  took:   15.521947 seconds
        md5          took:   21.895179 seconds
    
    Using a    16384 (2 ** 14) bytes message
        openssl_md5  took:   29.947287 seconds
        md5          took:   43.198639 seconds
    
    Using a    32768 (2 ** 15) bytes message
        openssl_md5  took:   59.123447 seconds
        md5          took:   86.453821 seconds
    
    Done.
    

  • Ubtu 16 pc064 (VM running in VirtualBox on the above machine):

    [064bit-prompt]> python3 code00.py
    Python 3.5.2 (default, Oct  8 2019, 13:06:37) [GCC 5.4.0 20160609] 64bit on linux
    
    timeit attempts number: 1000000
    
    Using a        1 (2 **  0) bytes message
        openssl_md5  took:    0.246166 seconds
        md5          took:    0.130589 seconds
    
    Using a        2 (2 **  1) bytes message
        openssl_md5  took:    0.251019 seconds
        md5          took:    0.127750 seconds
    
    Using a        4 (2 **  2) bytes message
        openssl_md5  took:    0.257018 seconds
        md5          took:    0.123116 seconds
    
    Using a        8 (2 **  3) bytes message
        openssl_md5  took:    0.245399 seconds
        md5          took:    0.128267 seconds
    
    Using a       16 (2 **  4) bytes message
        openssl_md5  took:    0.251832 seconds
        md5          took:    0.136373 seconds
    
    Using a       32 (2 **  5) bytes message
        openssl_md5  took:    0.248410 seconds
        md5          took:    0.140708 seconds
    
    Using a       64 (2 **  6) bytes message
        openssl_md5  took:    0.361016 seconds
        md5          took:    0.267021 seconds
    
    Using a      128 (2 **  7) bytes message
        openssl_md5  took:    0.478735 seconds
        md5          took:    0.413986 seconds
    
    Using a      256 (2 **  8) bytes message
        openssl_md5  took:    0.707602 seconds
        md5          took:    0.695042 seconds
    
    Using a      512 (2 **  9) bytes message
        openssl_md5  took:    1.216832 seconds
        md5          took:    1.268570 seconds
    
    Using a     1024 (2 ** 10) bytes message
        openssl_md5  took:    2.122014 seconds
        md5          took:    2.429623 seconds
    
    Using a     2048 (2 ** 11) bytes message
        openssl_md5  took:    4.158188 seconds
        md5          took:    4.847686 seconds
    
    Using a     4096 (2 ** 12) bytes message
        openssl_md5  took:    7.839173 seconds
        md5          took:    9.242224 seconds
    
    Using a     8192 (2 ** 13) bytes message
        openssl_md5  took:   15.282232 seconds
        md5          took:   18.368874 seconds
    
    Using a    16384 (2 ** 14) bytes message
        openssl_md5  took:   30.681912 seconds
        md5          took:   36.755073 seconds
    
    Using a    32768 (2 ** 15) bytes message
        openssl_md5  took:   60.230543 seconds
        md5          took:   73.237356 seconds
    
    Done.
    

The result seem to be quite different than yours. In my case:

  • Starting somewhere in [~512B .. ~1KiB] sized messages, OpenSSL implementation seems to perform better than builtin one
  • I know that there are too few results to claim a pattern, but it seems that both implementations seem to be linearly proportional (in terms of time) with message size (but the builtin slope seems to be a bit steeper - meaning it will perform worse on the long run)

As a conclusion, if all your messages are small, and the builtin implementation works best for you, then use it.



Update #1

Graphical representation (I had to reduce the timeit iterations number by an order of magnitude, as it would take much too long for large messages):

and zooming on the area where the 2 graphs intersect:

这篇关于_md5.md5是什么,为什么hashlib.md5这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆