对pycuda :: complex数组的元素明智的函数 [英] Element wise function on pycuda::complex array

查看:293
本文介绍了对pycuda :: complex数组的元素明智的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在大型2D复杂数组(最终为2 * 12x2 * 12个数据点)上运行函数。然而,pycuda不能按预期工作。 ElementWise函数在2d数组中不工作,所以我使用具有块大小的SourceModule函数。

I want to run a function on a large, 2D complex array (eventually 2*12x2*12 datapoints). However, pycuda does not work as expected. The ElementWise function doesn't work at 2d arrays, so I used the SourceModule function with block sizes.

现在的问题是GPU上的C代码不能与CPU上的numpy计算结果相同。

The problem is now that the C code on the GPU does not give the same result as the numpy calculation on the CPU. Very large and strange numbers are resulting.

我使用下面的代码。出现了什么问题?

I'm using the following code. What's going wrong?

#!/usr/bin/env python
#https://github.com/lebedov/scikits.cuda/blob/master/demos/indexing_2d_demo.py
"""
Demonstrates how to access 2D arrays within a PyCUDA kernel in a
numpy-consistent manner.
"""

from string import Template
import pycuda
import pycuda.autoinit
import pycuda.gpuarray as gpuarray
from pycuda.compiler import SourceModule
import numpy as np
from matplotlib import pyplot as plt

# Set size
A = 2**3
B = 2**3
N = A*B
x_gpu = gpuarray.to_gpu(np.fromfunction(lambda x,y: (1.+x)*np.exp(1.j*y*np.pi/10), (A,B)) )
y_gpu = gpuarray.to_gpu(np.fromfunction(lambda x,y: 1.*x, (A,B)).astype(
                                x_gpu.dtype)) 
d_gpu = gpuarray.to_gpu(np.zeros_like(x_gpu.get()))#.astype(np.float32))

func_mod_template = Template("""
// Macro for converting subscripts to linear index:
#define INDEX(a, b) a*${B}+b
#include <pycuda-complex.hpp>

//__global__ void func(double *d,double *x,double *y, unsigned int N) {
__global__ void func(pycuda::complex<float> *d,pycuda::complex<float> *x,
                     pycuda::complex<float> *y)
{
    // Obtain the linear index corresponding to the current thread:     
    // unsigned int idx =  blockIdx.y*blockDim.y*gridDim.x + 
                        blockIdx.x*blockDim.x*gridDim.y +threadIdx.x+threadIdx.y;
    unsigned int block_num        = blockIdx.x + blockIdx.y * gridDim.x;              
    unsigned int thread_num       = threadIdx.y * blockDim.x + threadIdx.x;           
    unsigned int threads_in_block = blockDim.x * blockDim.y;                          
    unsigned int idx              =  (threads_in_block * block_num + thread_num);

    // Convert the linear index to subscripts:
    unsigned int a = idx/${B};
    unsigned int b = idx%${B};

    // Use the subscripts to access the array:
    // d[INDEX(a,b)] = x[INDEX(a,b)]+y[INDEX(a,b)];
    pycuda::complex<float> j(0,arg(x[idx]));
    pycuda::complex<float> i(abs(x[idx]),0);
    d[idx] = i * exp(j);
}
""")

max_threads_per_block = pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_THREADS_PER_BLOCK)
max_block_dim = (pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_BLOCK_DIM_X),
                 pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_BLOCK_DIM_Y),
                 pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_BLOCK_DIM_Z))
max_grid_dim = (pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_GRID_DIM_X),
                pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_GRID_DIM_Y),
                pycuda.autoinit.device.get_attribute(pycuda._driver.device_attribute.MAX_GRID_DIM_Z))
max_blocks_per_grid = max(max_grid_dim)
block_dim = max_block_dim
block_dim = (max_block_dim[0],1,1)
grid_dim = (int(np.ceil(1.*x_gpu.shape[0]/block_dim[0])),
            int(np.ceil(1.*x_gpu.shape[1]/block_dim[1])))
print block_dim,grid_dim, N

func_mod = \
         SourceModule(func_mod_template.substitute(max_threads_per_block=max_threads_per_block,
                                                   max_blocks_per_grid=max_blocks_per_grid,
                                                   A=A, B=B))
func = func_mod.get_function('func')
func(d_gpu,x_gpu,y_gpu,
     block=block_dim,
    grid=grid_dim)

print d_gpu.get()/x_gpu.get()
#print 'Success status: ', np.allclose(x_np, x_gpu.get())
plt.imshow((d_gpu.get()/x_gpu.get()).real)
plt.colorbar()
plt.show()


推荐答案

作为一个实际的答案:改变 x_gpu 行至

As an actual answer: changing the x_gpu line to

x_gpu = gpuarray.to_gpu(np.fromfunction(
    lambda x,y: (1.+x)*np.exp(1.j*y*np.pi/10), (A,B)).astype(np.complex64) )

似乎可以解决这个问题。另外,虽然 ElementwiseKernel 不适用于2d数组,但你仍然使用2d-> 1d转换,所以没有什么真正阻止你写

seems to fix the problem. Also, although ElementwiseKernel does not work with 2d arrays, you are using 2d->1d transformation anyway, so nothing really stops you from writing

func = ElementwiseKernel(
    "pycuda::complex<float> *d, pycuda::complex<float> *x, pycuda::complex<float> *y",

    Template("""
    // Convert the linear index to subscripts:
    unsigned int a = i/${B};
    unsigned int b = i%${B};

    // Use the subscripts to access the array:
    //d[INDEX(a,b)] = x[INDEX(a,b)]+y[INDEX(a,b)];
    pycuda::complex<float> angle(0,arg(x[i]));
    pycuda::complex<float> module(abs(x[i]),0);
    d[i] = module * exp(angle);
    """).substitute(A=A, B=B),

    preamble=Template("""
    #define INDEX(a, b) a*${B}+b
    """).substitute(A=A, B=B))

func(d_gpu, x_gpu, y_gpu)

你不需要拼凑块/网格大小,因为 PyCUDA 会为你处理这个。

This way you will not need to juggle block/grid sizes because PyCUDA will handle this for you.

这篇关于对pycuda :: complex数组的元素明智的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆