是否可以使用GPU加速Python中的哈希处理? [英] Is it possible to use the GPU to accelerate hashing in Python?

查看:131
本文介绍了是否可以使用GPU加速Python中的哈希处理?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近阅读了杰夫(Jeff)的博客文章,标题为 Speed Hashing ,他提到的其他事项包括,利用GPU的功能,您可以真正快速地对事物进行哈希处理.

I recently read Jeff's blog post entitled Speed Hashing, where amongst other things he mentions that you can hash things really fast by harnessing the power of your GPU.

我想知道是否有可能利用GPU的功能来对Python(md5,sha-1等)中的事物进行哈希处理?

I was wondering whether or not it was possible to harness the power of the GPU to hash things in Python (md5, sha-1 etc)?

我对此感兴趣,因为它试图查看我能以多快的速度对事物进行暴力处理(不是真实的东西,是来自旧的泄漏的数据转储).

I'm interested in this for trying to see how fast I can brute-force things (not real world stuff, from old leaked data dumps).

此刻,我正在做这种事情(简化示例):

At the moment, I'm doing this sort of thing (simplified example):

from itertools import product
from hashlib import md5

hashes = ["some","hashes"]

chars = []
for i in range(97,123): # a-z only
    chars.append(chr(i))

for i in range(1,6): # all combos of a-z, 1-5 chars
    for c in product(chars,repeat=i):
       s = ''.join(c)
       if md5(s).hexdigest() in hashes:
           print "Found",s

但是我想知道是否有一种使用GPU来加快速度的方法?我猜我将需要一个模块,该模块会像这样顺序生成散列-有人知道吗?

But I was wondering whether there was a way to speed that up using the GPU? I'm guessing I'm going to need a module that would serially generate hashes like this - does anyone know of one?

推荐答案

有两个障碍:

  1. 编写要在GPU上执行的程序.AFAIK,目前没有机制可将Python程序转换为GPU执行的代码.因此,除非您能找到所需的内容(这可能是可行的,因为它看起来是一个相当普遍的用例),否则您将必须使用一种GPU编程语言(CUDA,OpenCL,Haskell等)来实现..)
  2. 从Python调用在GPU上运行的程序,并交换数据.有几个Python + CUDA项目可以做到这一点:

  1. Writing a program to execute on the GPU. AFAIK, there is no mechanism currently available to convert an Python program to the code executed by a GPU. So unless you can find what you need (which might be possible, as it looks like a reasonably common use case), then you are going to have to do that using one of the GPU programming languages (CUDA, OpenCL, Haskell, ...)
  2. Calling the program running on the GPU from Python, and exchanging data. There are several Python+CUDA projects which do that part:

通过适当的搜索,您可能会发现更多.

With suitable searching you might find more.

Python GPU编程看起来很重要

然后,Python程序将使用第2部分或同等技术之一加载并调用GPU内核"(使用该答案第1部分中的技术创建的程序).

Then the Python program will load and invoke the GPU 'kernel' (the program created using technology from part 1 of this answer) using one of the technologies from part 2, or equivalent.

编辑您可能会生成整个蛮力"值集,并且在GPU上生成md5哈希值.然后只需使用Python检索结果即可.这可能比在Python中生成值,将其传递给GPU,然后再返回md5的值更容易.

EDIT You might generate the entire set of 'brute force' values, and the md5 hashes on the GPU. Then just retrieve results using Python. This might be easier than generating the values in Python, passing them to the GPU, then getting back the md5's.

如果我了解的话,该程序会生成所有1个字符,2、3、4、5和6个小写字母字符串,并生成md5哈希,是吗?

If I've understood, the program generates all 1 character, 2, 3, 4, 5, and 6 lower-case-letter strings, and generates a md5 hash, yes?

Edit2-我以前的分析是完全错误的-我道歉

Edit2 - My Previous analysis was utterly wrong - I apologise

Edit3:略过 Wikipedia MD5 ,它看起来像为恒定长度的字符串计算MD5(例如,可以优化6个ASCII字符).

Skimming Wikipedia MD5 it looks like calculating the MD5 for a constant length string (e.g. 6 ASCII characters) can be optimised.

根据Wikipedia的伪代码,它只有64个循环,每组16个循环使用相同的算法进行迭代.因此,如果密钥小于55个字节,则计算的核心可以从以下位置展开":

According to Wikipedia's pseudo-code it is only 64 loops, with groups of 16 loop iterations using the same arithmetic. So, if the key is under 55 bytes, the core of the calculation could be 'unrolled' from:

for i from 0 to 63
    if 0 ≤ i ≤ 15 then
        f := (b and c) or ((not b) and d)
        g := i
    else if 16 ≤ i ≤ 31
        f := (d and b) or ((not d) and c)
        g := (5×i + 1) mod 16
    else if 32 ≤ i ≤ 47
        f := b xor c xor d
        g := (3×i + 5) mod 16
    else if 48 ≤ i ≤ 63
        f := c xor (b or (not d))
        g := (7×i) mod 16
    temp := d
    d := c
    c := b
    b := b + leftrotate((a + f + k[i] + w[g]) , r[i])
    a := temp
end for

收件人:

// i == 0
f := (b and c) or ((not b) and d)   // +4 ops
// g := i
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[0] + w[0]) , r[0])  // 9 ops
a := temp
// i == 1
f := (b and c) or ((not b) and d)
// g := i
temp := d
d := c
c := b
b := b + leftrotate((a + f + k[1] + w[1]) , r[1])
a := temp

这种展开会导致某些数组索引保持恒定,这应该使一个好的GPU编译器可以进行更恒定的传播.这可能会带来重大改进.每个步骤大约需要9个操作,而编译器将需要重新整理5条数据,因此大约需要14个操作/步骤* 64个步骤,大约需要1000个操作.

That unrolling causes some of the array indexing to be constant, which should allow a good GPU compiler to do even more constant propagation. This might give a significant improvement. Each step is approximately 9 operations and the compiler will need to shuffle 5 pieces of data, so roughly 14 operations/step * 64 steps, approximately 1000 operations.

Edit4:
Glerk!我已经阅读了更多Wikipedia MD5算法-MD5比我想象的要容易攻击.每个直接由16个 组组成的前两个循环仅使用6个字节的可变键字符串,其余的字符串是恒定的.该算法的其余部分是改组和按位运算,很可能需要进行非常重要的进一步优化.每16个循环中只有2个涉及该密钥,因此速度可能会提高8倍,甚至可能超过4倍.


Glerk! I've read more of the Wikipedia MD5 algorithm - MD5 is easier to attack than I'd realised. Only the first two loops of each group of 16 directly use the variable key string of 6 bytes, the rest of the string is constant. The rest of the algorithm is shuffling and bit-wise operations, which are likely amenable to very significant further optimisation. Only 2 of each 16 loops involves the key, then that might be up to 8x faster, and maybe more than 4x.

因此,而不是运行在1GHz上的1024个核心GPU,而是提供了1024个哈希/微秒,而是说4096/微秒或8096/us = 4-8个哈希/纳秒

So rather than 1024 core GPU, running at 1GHz, giving 1024 hashes/microsecond, instead say 4096/microsecond or 8096/us = 4-8 hashes/nanosecond

大约有27 ^ 6个键= 387,420,489个键,因此md5哈希值.

There are approximately 27^6 keys = 387,420,489 keys and hence md5 hashes.

387,420,489键/4-8/纳秒大约= 0.05-0.1秒

387,420,489 keys / 4-8/nanosecond approximately = 0.05 - 0.1 seconds

主机与GPU之间的通信将非常缓慢,但不可能超过100%.

Communication between host and GPU, will be quite slow, but unlikely more than 100%.

大约在0.1秒到0.2秒之间.

So approximately between 0.1 seconds and 0.2 seconds.

md5哈希为16个字节,因此如果要存储,则将消耗6.2 GB.在两个现代GPU上,每个GPU仅需要2次传输,但是这将是非常大的开销.如果哈希值保存到磁盘(甚至使用SSD),或通过10Gbit以太网移动,则哈希生成将被I/O时间淹没.

An md5 hash is 16 bytes, so that would consume 6.2 GBytes if it were to be stored. On two modern GPUs, that would only require 2 transfers each, but would be a very significant overhead. If the hash values are saved to disk (even using SSD), or moved over 10Gbit Ethernet, the hash generation is swamped by the I/O time.

只有94个可打印的ASCII字符,因此对于每个ASCII 6个字符的键:

There are only 94 printable ASCII characters, so for every ASCII 6 character key:

94 ^ 6 = 689,869,781,056键/4-8/纳秒= 86-172秒

94^6 = 689,869,781,056 keys / 4-8/nanosecond = 86-172 seconds

哦,我的天!-(

长键,比MD5更好!

也许尝试编写Python程序来生成最佳的GPU算法?

Maybe try to write a Python program to generate the optimal GPU algorithm?

通过展开" Python程序中的循环来生成GPU内核"的文本,并打印直线计算的文本,并填充所有常量.

Generate the text of the GPU 'kernel' by 'Unrolling' the loops within the Python program, and print the text of the straight-line calculation, with all the constants filled in.

然后尝试找出最佳的指令顺序是为每个密钥长度计算MD5.使用展开的程序,尝试跟踪每个位以及相关性上的操作,然后尝试重新组装位&将其运算转换为连续的32位字并进行新的直线计算.(为公平起见,也许GPU编译器仍然可以执行某些操作?找出来可能很有趣)

Then try to figure out what the optimal sequence of instructions is to calculate the MD5 for each key-length. Using the unrolled program, try to track the operations on each bit, and the dependencies, then try to reassemble the bits & their operations into contiguous 32bit words and a new straight-line calculation. (To be fair, maybe the GPU compiler can do some of this anyway? Might be interesting to find out)

这篇关于是否可以使用GPU加速Python中的哈希处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆