Python:重写循环的numpy数学函数以在GPU上运行 [英] Python: rewrite a looping numpy math function to run on GPU

查看：260 发布时间：2020/5/18 20:33:59 python numpy tensorflow theano numba

本文介绍了Python:重写循环的numpy数学函数以在GPU上运行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

有人可以帮我重写这个函数(doTheMath函数)在GPU上进行计算吗?我现在花了好几天试图设法解决这个问题，但是没有结果.我想知道也许有人会以您认为适合日志的任何方式帮助我重写此函数，因为最后我给出的结果相同.我尝试从numba使用@jit，但是由于某些原因，它实际上比正常运行代码要慢得多.样本量巨大，目标是大幅减少执行时间，因此自然而然地相信GPU是最快的执行方法.

Can someone help me rewrite this one function (the doTheMath function) to do the calculations on the GPU? I used a few good days now trying to get my head around it but to no result. I wonder maybe somebody can help me rewrite this function in whatever way you may seem fit as log as I gives the same result at the end. I tried to use @jit from numba but for some reason it is actually much slower than running the code as usual. With a huge sample size, the goal is to decrease the execution time considerably so naturally I believe the GPU is the fastest way to do it.

我会解释一些实际发生的情况.实际数据看上去与在下面的代码中创建的样本数据几乎相同，被划分为每个样本大约5.000.000行或每个文件大约150MB的样本大小.总共大约有600.000.000行或20GB的数据.我必须遍历此数据，逐个样本，然后在每个样本中逐行，取每行的最后2000行(或另一行)，并运行返回结果的doTheMath函数.然后将结果保存到hardrive中，在那里我可以使用另一个程序来做其他事情.正如您在下面看到的，我不需要所有行的所有结果，只需要那些大于特定数量的行.如果我现在在python中运行函数，则每1.000.000行大约得到62秒.考虑所有数据及其处理速度，这是一个很长的时间.

I'll explain a little what is actually happening. The real data, which looks almost identical as the sample data created in the code below is divided into sample sizes of approx 5.000.000 rows each sample or around 150MB per file. In total there are around 600.000.000 rows or 20GB of data. I must loop through this data, sample by sample and then row by row in each sample, take the last 2000 (or another) rows as of each line and run the doTheMath function which returns a result. That result is then saved back to the hardrive where I can do some other things with it with another program. As you can see below, I do not need all of the results of all the rows, only those bigger than a specific amount. If I run my function as it is right now in python I get about 62seconds per 1.000.000 rows. This is a very long time considering all the data and how fast it should be done with.

我必须提到的是，在data = joblib.load(file)的帮助下，我逐个文件地将实际数据文件上传到RAM，因此上载数据不是问题，因为每个文件仅花费约0.29秒的时间.上传后，我将运行下面的完整代码. doTheMath函数是花费时间最多的函数.对于愿意帮助我重写此简单代码以在GPU上运行的人员，我愿意给予我在stackoverflow上获得的全部500个信誉点.我的兴趣特别在GPU上，我真的很想看看如何解决当前的问题.

I must mention that I upload the real data file by file to the RAM with the help of data = joblib.load(file) so uploading the data is not the problem as it takes only about 0.29 seconds per file. Once uploaded I run the entire code below. What takes the longest time is the doTheMath function. I am willing to give all of my 500 reputation points I have on stackoverflow as a reward for somebody willing to help me rewrite this simple code to run on the GPU. My interest is specifically in the GPU, I really want to see how it is done on this problem at hand.

编辑/更新1: 以下是一些真实数据样本的链接: data_csv.zip 大约102000行的真实数据1和2000行用于实际的data2a和data2b.在实际样本数据上使用minimumLimit = 400

EDIT/UPDATE 1: Here is a link to a small sample of the real data: data_csv.zip About 102000 rows of real data1 and 2000 rows for real data2a and data2b. Use minimumLimit = 400 on the real sample data

编辑/更新2: 对于那些关注此帖子的人，这里是下面答案的简短摘要.到目前为止，我们对原始解决方案有4个答案. @Divakar提供的一个只是对原始代码的调整.在这两个调整中，只有第一个实际上适用于此问题，第二个调整是很好的调整，但不适用于此处.在其他三个答案中，两个是基于CPU的解决方案，另一个是tensorflow-GPU尝试. Paul Panzer的Tensorflow-GPU似乎很有希望，但是当我在GPU上实际运行它时，它的速度比原始速度慢，因此代码仍需改进.

EDIT/UPDATE 2: For those following this post here is a short summary of the answers below. Up until now we have 4 answers to the original solution. The one offered by @Divakar are just tweaks to the original code. Of the two tweaks only the first one is actually applicable to this problem, the second one is a good tweak but does not apply here. Out of the other three answers, two of them are CPU based solutions and one tensorflow-GPU try. The Tensorflow-GPU by Paul Panzer seems to be promising but when i actually run it on the GPU it is slower than the original, so the code still needs improvement.

另外两个基于CPU的解决方案由@PaulPanzer(一个纯numpy解决方案)和@MSeifert(一个numba解决方案)提交.与原始代码相比，这两种解决方案都可以提供非常好的结果，并且两种处理数据都非常快.保罗·潘泽(Paul Panzer)提交的两个中，速度更快.它在大约3秒钟内处理大约1.000.000行.唯一的问题是使用较小的batchSizes，可以通过切换到MSeifert提供的numba解决方案或在下面讨论的所有调整之后甚至使用原始代码来克服这一问题.

The other two CPU based solutions are submitted by @PaulPanzer (a pure numpy solution) and @MSeifert (a numba solution). Both solutions give very good results and both process data extremely fast compared to the original code. Of the two the one submitted by Paul Panzer is faster. It processes about 1.000.000 rows in about 3 seconds. The only problem is with smaller batchSizes, this can be overcome by either switching to the numba solution offered by MSeifert, or even the original code after all the tweaks that have been discussed below.

对于@PaulPanzer和@MSeifert在回答问题上所做的工作，我感到非常高兴和感谢.尽管如此，由于这是一个基于GPU的解决方案的问题，我在等待是否有人愿意尝试一下GPU版本，并查看与当前CPU相比，GPU上数据处理的速度要快多少.解决方案.如果没有其他答案胜过@PaulPanzer的纯粹的numpy解决方案，那么我将接受他的答案作为正确的答案，并获得赏金:)

I am very happy and thankful to @PaulPanzer and @MSeifert for the work they did on their answers. Still, since this is a question about a GPU based solution, i am waiting to see if anybody is willing to give it a try on a GPU version and see how much faster the data can be processed on the GPU when compared to the current CPU solutions. If there will be no other answers outperforming @PaulPanzer's pure numpy solution then i'll accept his answer as the right one and gets the bounty :)

编辑/更新3: @Divakar发布了针对GPU的解决方案的新答案.经过对真实数据的测试，其速度甚至无法与CPU对应的解决方案相提并论. GPU在大约1.5秒内处理大约5.000.000.这真是不可思议:)我对GPU解决方案感到非常兴奋，感谢@Divakar的发布.我还要感谢@PaulPanzer和@MSeifert的CPU解决方案:)由于GPU的原因，现在我的研究继续以惊人的速度进行:)

EDIT/UPDATE 3: @Divakar has posted a new answer with a solution for the GPU. After my testings on real data, the speed is not even comparable to the CPU counterpart solutions. The GPU processes about 5.000.000 in about 1,5 seconds. This is incredible :) I am very excited about the GPU solution and i thank @Divakar for posting it. As well as i thank @PaulPanzer and @MSeifert for their CPU solutions :) Now my research continues with an incredible speed due to the GPU :)

import pandas as pd
import numpy as np
import time

def doTheMath(tmpData1, data2a, data2b):
    A = tmpData1[:, 0]
    B = tmpData1[:,1]
    C = tmpData1[:,2]
    D = tmpData1[:,3]
    Bmax = B.max()
    Cmin  = C.min()
    dif = (Bmax - Cmin)
    abcd = ((((A - Cmin) / dif) + ((B - Cmin) / dif) + ((C - Cmin) / dif) + ((D - Cmin) / dif)) / 4)
    return np.where(((abcd <= data2a) & (abcd >= data2b)), 1, 0).sum()

#Declare variables
batchSize = 2000
sampleSize = 5000000
resultArray = []
minimumLimit = 490 #use 400 on the real sample data 

#Create Random Sample Data
data1 = np.matrix(np.random.uniform(1, 100, (sampleSize + batchSize, 4)))
data2a = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #upper limit
data2b = np.matrix(np.random.uniform(0, 1, (batchSize, 1))) #lower limit
#approx. half of data2a will be smaller than data2b, but that is only in the sample data because it is randomly generated, NOT the real data. The real data2a is always higher than data2b.


#Loop through the data
t0 = time.time()
for rowNr in  range(data1.shape[0]):
    tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
    if(tmp_df.shape[0] == batchSize):
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result])
print('Runtime:', time.time() - t0)

#Save data results
resultArray = np.array(resultArray)
print(resultArray[:,1].sum())
resultArray = pd.DataFrame({'index':resultArray[:,0], 'result':resultArray[:,1]})
resultArray.to_csv("Result Array.csv", sep=';')

我正在处理的PC规格:

The PC specs I am working on:

GTX970(4gb) video card; 
i7-4790K CPU 4.00Ghz; 
16GB RAM;
a SSD drive 
running Windows 7;

作为附带的问题，SLI中的第二张视频卡是否可以帮助解决此问题?

As a side question, would a second video card in SLI help on this problem?

推荐答案

简介和解决方案代码

好吧，你要了！因此，本文中列出的是使用 PyCUDA 的实现轻量级包装器扩展了Python环境中CUDA的大多数功能.我们将使用它的SourceModule功能，使我们能够编写和编译驻留在Python环境中的CUDA内核.

Well, you asked for it! So, listed in this post is an implementation with PyCUDA that uses lightweight wrappers extending most of CUDA's capabilities within Python environment. We will its SourceModule functionality that lets us write and compile CUDA kernels staying in Python environment.

要解决当前的问题，在所涉及的计算中，我们有滑动的最大值和最小值，几乎没有差异，除法和比较.对于涉及块最大值查找(针对每个滑动窗口)的最大和最小部分，我们将使用简化技术，如在以下详细讨论的部分 page-18 . PyCUDA还支持诸如max和min之类的用于计算缩减量的内置函数，但是我们失去了控制力，特别是我们打算使用专用内存(例如共享和恒定内存)来将GPU发挥到最佳水平.

Getting to the problem at hand, among the computations involved, we have sliding maximum and minimum, few differences and divisions and comparisons. For the maximum and minimum parts, that involves block max finding (for each sliding window), we will use reduction-technique as discussed in some detail here. This would be done at block level. For the upper level iterations across sliding windows, we would use the grid level indexing into CUDA resources. For more info on this block and grid format, please refer to page-18. PyCUDA also supports builtins for computing reductions like max and min, but we lose control, specifically we intend to use specialized memory like shared and constant memory for leveraging GPU at its near to optimum level.

列出PyCUDA-NumPy解决方案代码-

Listing out the PyCUDA-NumPy solution code -

1] PyCUDA部分-

1] PyCUDA part -

import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule("""
#define TBP 1024 // THREADS_PER_BLOCK

__device__ void get_Bmax_Cmin(float* out, float *d1, float *d2, int L, int offset)
{
    int tid = threadIdx.x;
    int inv = TBP;
    __shared__ float dS[TBP][2];

    dS[tid][0] = d1[tid+offset];  
    dS[tid][1] = d2[tid+offset];         
    __syncthreads();

    if(tid<L-TBP)  
    {
        dS[tid][0] = fmaxf(dS[tid][0] , d1[tid+inv+offset]);
        dS[tid][1] = fminf(dS[tid][1] , d2[tid+inv+offset]);
    }
    __syncthreads();
    inv = inv/2;

    while(inv!=0)   
    {
        if(tid<inv)
        {
            dS[tid][0] = fmaxf(dS[tid][0] , dS[tid+inv][0]);
            dS[tid][1] = fminf(dS[tid][1] , dS[tid+inv][1]);
        }
        __syncthreads();
        inv = inv/2;
    }
    __syncthreads();

    if(tid==0)
    {
        out[0] = dS[0][0];
        out[1] = dS[0][1];
    }   
    __syncthreads();
}

__global__ void main1(float* out, float *d0, float *d1, float *d2, float *d3, float *lowL, float *highL, int *BLOCKLEN)
{
    int L = BLOCKLEN[0];
    int tid = threadIdx.x;
    int iterID = blockIdx.x;
    float Bmax_Cmin[2];
    int inv;
    float Cmin, dif;   
    __shared__ float dS[TBP*2];   

    get_Bmax_Cmin(Bmax_Cmin, d1, d2, L, iterID);  
    Cmin = Bmax_Cmin[1];
    dif = (Bmax_Cmin[0] - Cmin);

    inv = TBP;

    dS[tid] = (d0[tid+iterID] + d1[tid+iterID] + d2[tid+iterID] + d3[tid+iterID] - 4.0*Cmin) / (4.0*dif);
    __syncthreads();

    if(tid<L-TBP)  
        dS[tid+inv] = (d0[tid+inv+iterID] + d1[tid+inv+iterID] + d2[tid+inv+iterID] + d3[tid+inv+iterID] - 4.0*Cmin) / (4.0*dif);                   

     dS[tid] = ((dS[tid] >= lowL[tid]) & (dS[tid] <= highL[tid])) ? 1 : 0;
     __syncthreads();

     if(tid<L-TBP)
         dS[tid] += ((dS[tid+inv] >= lowL[tid+inv]) & (dS[tid+inv] <= highL[tid+inv])) ? 1 : 0;
     __syncthreads();

    inv = inv/2;
    while(inv!=0)   
    {
        if(tid<inv)
            dS[tid] += dS[tid+inv];
        __syncthreads();
        inv = inv/2;
    }

    if(tid==0)
        out[iterID] = dS[0];
    __syncthreads();

}
""")

请注意，THREADS_PER_BLOCK, TBP将基于batchSize进行设置.经验法则是为TBP分配2值的幂，该值小于batchSize.因此，对于batchSize = 2000，我们需要TBP作为1024.

Please note that THREADS_PER_BLOCK, TBP is to be set based on the batchSize. The rule of thumb here is to assign power of 2 value to TBP that is just lesser than batchSize. Thus, for batchSize = 2000, we needed TBP as 1024.

2] NumPy部分-

2] NumPy part -

def gpu_app_v1(A, B, C, D, batchSize, minimumLimit):
    func1 = mod.get_function("main1")
    outlen = len(A)-batchSize+1

    # Set block and grid sizes
    BSZ = (1024,1,1)
    GSZ = (outlen,1)

    dest = np.zeros(outlen).astype(np.float32)
    N = np.int32(batchSize)
    func1(drv.Out(dest), drv.In(A), drv.In(B), drv.In(C), drv.In(D), \
                     drv.In(data2b), drv.In(data2a),\
                     drv.In(N), block=BSZ, grid=GSZ)
    idx = np.flatnonzero(dest >= minimumLimit)
    return idx, dest[idx]

基准化

我已经在GTX 960M上进行了测试.请注意，PyCUDA期望数组是连续的顺序.因此，我们需要对列进行切片并进行复制.我期望/假设可以从文件中读取数据，以便数据沿行而不是按列传播.因此，暂时将其排除在基准测试功能之外.

I have tested on GTX 960M. Please note that PyCUDA expects arrays to be of contiguous order. So, we need to slice the columns and make copies. I am expecting/assuming that the data could be read from the files such that the data is spread along rows instead of being as columns. Thus, keeping those out of the benchmarking function for now.

原始方法-

def org_app(data1, batchSize, minimumLimit):
    resultArray = []
    for rowNr in  range(data1.shape[0]-batchSize+1):
        tmp_df = data1[rowNr:rowNr + batchSize] #rolling window
        result = doTheMath(tmp_df, data2a, data2b)
        if (result >= minimumLimit):
            resultArray.append([rowNr , result]) 
    return resultArray

时间和验证-

In [2]: #Declare variables
   ...: batchSize = 2000
   ...: sampleSize = 50000
   ...: resultArray = []
   ...: minimumLimit = 490 #use 400 on the real sample data
   ...: 
   ...: #Create Random Sample Data
   ...: data1 = np.random.uniform(1, 100000, (sampleSize + batchSize, 4)).astype(np.float32)
   ...: data2b = np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: data2a = data2b + np.random.uniform(0, 1, (batchSize)).astype(np.float32)
   ...: 
   ...: # Make column copies
   ...: A = data1[:,0].copy()
   ...: B = data1[:,1].copy()
   ...: C = data1[:,2].copy()
   ...: D = data1[:,3].copy()
   ...: 
   ...: gpu_out1,gpu_out2 = gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
   ...: cpu_out1,cpu_out2 = np.array(org_app(data1, batchSize, minimumLimit)).T
   ...: print(np.allclose(gpu_out1, cpu_out1))
   ...: print(np.allclose(gpu_out2, cpu_out2))
   ...: 
True
False

因此，CPU和GPU计数之间存在一些差异.让我们调查一下-

So, there's some differences between CPU and GPU countings. Let's investigate them -

In [7]: idx = np.flatnonzero(~np.isclose(gpu_out2, cpu_out2))

In [8]: idx
Out[8]: array([12776, 15208, 17620, 18326])

In [9]: gpu_out2[idx] - cpu_out2[idx]
Out[9]: array([-1., -1.,  1.,  1.])

有四个不匹配计数的实例.这些在1处最大关闭.经过研究，我发现了一些有关此的信息.基本上，由于我们将数学内在函数用于最大和最小计算，并且我认为那些内在函数导致浮动pt表示形式中的最后一个二进制位与CPU对应符不同.这被称为ULP错误，并已详细讨论 here 和

There are four instances of non-matching counts. These are off at max by 1. Upon research, I came across some information on this. Basically, since we are using math intrinsics for max and min computations and those I think are causing the last binary bit in the floating pt representation to be diferent than the CPU counterpart. This is termed as ULP error and has been discused in detail here and here.

最后，撇开问题，让我们来谈谈最重要的一点，即性能-

Finally, puting the issue aside, let's get to the most important bit, the performance -

In [10]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 2.18 s per loop

In [11]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
10 loops, best of 3: 82.5 ms per loop

In [12]: 2180.0/82.5
Out[12]: 26.424242424242426

让我们尝试更大的数据集.使用sampleSize = 500000，我们得到-

Let's try with bigger datasets. With sampleSize = 500000, we get -

In [14]: %timeit org_app(data1, batchSize, minimumLimit)
1 loops, best of 3: 23.2 s per loop

In [15]: %timeit gpu_app_v1(A, B, C, D, batchSize, minimumLimit)
1 loops, best of 3: 821 ms per loop

In [16]: 23200.0/821
Out[16]: 28.25822168087698

因此，提速保持恒定在 27 左右.

So, the speedup stays constant at around 27.

限制:

1)我们正在使用float32编号，因为GPU与这些编号配合使用效果最佳.特别是在非服务器GPU上，双精度在性能方面并不受欢迎，并且由于您正在使用这样的GPU，因此我在float32上进行了测试.

1) We are using float32 numbers, as GPUs work best with those. Double precision specially on non-server GPUs aren't popular when it comes to performance and since you are working with such a GPU, I tested with float32.

进一步的改进:

1)我们可以使用更快的constant memory来输入data2a和data2b，而不是使用global memory.

1) We could use faster constant memory to feed in data2a and data2b, rather than use global memory.

这篇关于Python:重写循环的numpy数学函数以在GPU上运行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python:重写循环的numpy数学函数以在GPU上运行 [英] Python: rewrite a looping numpy math function to run on GPU

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python:重写循环的numpy数学函数以在GPU上运行 [英] Python: rewrite a looping numpy math function to run on GPU

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭