在PyCuda中原位转置3D阵列 [英] Inplace transpose of 3D array in PyCuda

查看:138
本文介绍了在PyCuda中原位转置3D阵列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个3D数组,想转置其前两个维度(x& y),但不转置第三个维度(z).在3D数组A上,我希望得到与numpy的A.transpose((1,0,2))相同的结果.具体来说,我想获取转置的"全局threadIdx.下面的代码应该将转置索引写入3D数组A中未转置的位置.

I have a 3D array and would like to transpose its first two dimensions (x & y), but not the 3rd (z). On a 3D array A I want the same result as numpy's A.transpose((1,0,2)). Specifically, I want to get the "transposed" global threadIdx. The code below is supposed to write the transposed index at the untransposed location in 3D array A. It doesn't.

有什么建议吗?

import numpy as np
from pycuda import compiler, gpuarray
import pycuda.driver as cuda
import pycuda.autoinit

kernel_code = """
__global__ void test_indexTranspose(uint*A){
    const size_t size_x = 4;
    const size_t size_y = 4;
    const size_t size_z = 3;

    // Thread position in each dimension
    const size_t tx = blockDim.x * blockIdx.x + threadIdx.x;
    const size_t ty = blockDim.y * blockIdx.y + threadIdx.y;
    const size_t tz = blockDim.z * blockIdx.z + threadIdx.z;

    if(tx < size_x && ty < size_y && tz < size_z){
        // Flat index
        const size_t ti = tz * size_x * size_y + ty * size_x + tx;
        // Transposed flat index
        const size_t tiT = tz * size_x * size_y + tx * size_x + ty;
        A[ti] = tiT;
    }
}
"""

A = np.zeros((4,4,3),dtype=np.uint32)
mod = compiler.SourceModule(kernel_code)
test_indexTranspose = mod.get_function('test_indexTranspose')
A_gpu = gpuarray.to_gpu(A)
test_indexTranspose(A_gpu, block=(2, 2, 1), grid=(2,2,3))

这是返回的内容(不是我所期望的):

This is what is returned (not what I expected):

A_gpu.get()[:,:,0]
array([[ 0, 12,  9,  6],
       [ 3, 15, 24, 21],
       [18, 30, 27, 36],
       [33, 45, 42, 39]], dtype=uint32)

A_gpu.get()[:,:,1]
array([[ 4,  1, 13, 10],
       [ 7, 16, 28, 25],
       [22, 19, 31, 40],
       [37, 34, 46, 43]], dtype=uint32)

A_gpu.get()[:,:,2]
array([[ 8,  5,  2, 14],
       [11, 20, 17, 29],
       [26, 23, 32, 44],
       [41, 38, 35, 47]], dtype=uint32)

这是我的预期(但未返回):

This is what I expected (but was not returned):

A_gpu.get()[:,:,0]
array([[0, 4, 8,  12],
       [1, 5, 9,  13],
       [2, 6, 10, 14],
       [3, 7, 11, 15]], dtype=uint32)

A_gpu.get()[:,:,1]
array([[16, 20, 24, 28],
       [17, 21, 25, 29],
       [18, 22, 26, 30],
       [19, 23, 27, 31]], dtype=uint32)

A_gpu.get()[:,:,2]
...

谢谢

推荐答案

创建具有与CUDA内核代码一致的步幅的numpy数组即可解决此问题. numpy数组的默认布局不是内核假设的行,列和深度.但是,可以在创建数组时设置步幅.
如果以如下方式创建数组,则上述内核可以正常工作:

Creating the numpy array with strides that are consistent with the CUDA kernel code solves the problem. Default layout of a numpy array is not row, column, depth as my kernel assumes. However, the strides can be set when creating the array.
The above kernel works fine if the array is created like this:

nRows = 4
nCols = 4
nSlices = 3
nBytes = np.dtype(np.uint32).itemsize
A = np.ndarray(shape=(nRows, nCols, nSlices), 
               dtype=np.uint32, 
               strides=(nCols*nBytes, 1*nBytes, nCols*nRows*nBytes))

跨度是每个维度(以字节为单位)需要在内存连续索引中进行的跳转.例如.从第1行的第一个元素到第2行的第一个元素有nCols * nBytes,即16个字节.

The strides are the jumps in memory consecutive indices need to take for each dimension in bytes. E.g. from the 1st element in row 1 to the 1st element in row 2 there are nCols * nBytes, i.e. 16 bytes.

这篇关于在PyCuda中原位转置3D阵列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆