CUDA-从3D阵列提取图层 [英] CUDA - Extract Layer from 3D array
问题描述
我有一个3D矩阵,其中x-y平面表示图像,z-平面表示图像层.
问题是,当我尝试使用idz提取第一层(或其他层)时,没有得到预期的结果.看起来数组一旦放在CUDA中,对于x,y或z的索引就比我期望的要大(与pycuda一样).我通过下面的结果数组看到了这一点.
以下是此迷你示例的逐步过程(我使用通用int编号表示图像,以保存上传的图像和整个代码)!
我在这里导入库并定义图像大小和图层...
I have a 3D matrix where the x-y plane(s) represent an image and the z-plane represents image layers.
The issue is when I try to extract the first (or other layers) using idz, I do not get the expected results. It looks like the array, once in CUDA, has different indexes for x, y or z than what I expect (as in pycuda). I see this by the result array below.
The following is a step by step process for this mini example (I used generic int numbers to represent my images to save uploading images and the entire code)!
Here I import libraries and define image size and layers...
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
from pycuda.gpuarray import to_gpu
row = 10
column = 10
depth = 5
然后我定义我的输入3D数组和我的输出2D数组...
Then I define my input 3D array and my output 2D array...
#--==== Input 3D Array ====---
arrayA = numpy.full((row, column, depth), 0)
#populate each layer with fixed values
for i in range(depth):
arrayA[:,:,i] = i + 1
arrayA = arrayA.astype(numpy.uint16)
arrayA_gpu = cuda.mem_alloc(arrayA.nbytes)
cuda.memcpy_htod(arrayA_gpu, arrayA)
arrayA_Answer = numpy.empty_like(arrayA)
#--==== Output 2D array container ====---
arrayB = numpy.zeros([row, column], dtype = numpy.uint16)
arrayB_gpu = cuda.mem_alloc(arrayB.nbytes)
cuda.memcpy_htod(arrayB_gpu, arrayB)
arrayB_Answer = numpy.empty_like(arrayB)
接下来,我在pycuda中定义CUDA内核和功能
Next I define the CUDA kernal and function in pycuda
mod = SourceModule("""
__global__ void getLayer(int *arrayA, int *arrayB)
{
int idx = threadIdx.x + (blockIdx.x * blockDim.x); // x coordinate (numpy axis 2)
int idy = threadIdx.y + (blockIdx.y * blockDim.y); // y coordinate (numpy axis 1)
int idz = 0; //The first layer, this can set in range from 0-4
int x_width = (blockDim.x * gridDim.x);
int y_width = (blockDim.y * gridDim.y);
arrayB[idx + (x_width * idy)] = arrayA[idx + (x_width * idy) + (x_width * y_width) * idz];
}
""")
func = mod.get_function("getLayer")
func(arrayA_gpu, arrayB_gpu, block=(row, column, 1), grid=(1,1))
使用标准pycuda命令,提取结果(不是我期望的结果)
arrayA [:,:,0] = 10x10矩阵,其中填充了1(好)
Using standard pycuda commands, I extract the results (not what I expected)
arrayA[:,:,0] = 10x10 matrix populated with 1's (good)
print(arrayA_Answer[:,:,0])
[[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1 1]]
arrayB [:,:] =用以下(错误)填充的10x10矩阵,期望等于arrayA [:,:,0] ...
arrayB[:,:] = 10x10 matrix populated with the following (bad), expected to be equal to arrayA[:,:,0]...
print(arrayB_Answer)
[[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]
[1 2 3 4 5 1 2 3 4 5]]
推荐答案
如所讨论的这里,numpy的3D存储顺序模式是" z
"(即"3rd")索引是快速变化的索引,随着您在内存中线性前进.您的代码假定第一个索引(" x
")是快速变化的索引.
As discussed here, the numpy 3D storage order pattern is that the "z
" (i.e. "3rd") index is the rapidly varying index, as you progress linearly through memory. Your code assumes that the first index ("x
") is the rapidly varying one.
由于已经为高效(合并")加载/存储行为组织了内核,因此可以通过以numpy重新排列图像/图层/切片的存储顺序来解决此问题.这是一个可行的示例:
Since your kernel is already organized for efficient ("coalesced") load/store behavior, you could address this by reordering the storage of your images/layers/slices in numpy. Here is a worked example:
$ cat t10.py
from __future__ import print_function
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
from pycuda.gpuarray import to_gpu
row = 5
column = 10
depth = 10
#--==== Input 3D Array ====---
arrayA = numpy.full((row, column, depth), 0)
my_slice=numpy.int32(3) # choose the layer
#populate each layer with fixed values
for i in range(row):
arrayA[i,:,:] = i + 1
arrayA = arrayA.astype(numpy.int32)
arrayA_gpu = cuda.mem_alloc(arrayA.nbytes)
cuda.memcpy_htod(arrayA_gpu, arrayA)
arrayA_Answer = numpy.empty_like(arrayA)
#--==== Output 2D array container ====---
arrayB = numpy.zeros([column, depth], dtype = numpy.int32)
arrayB_gpu = cuda.mem_alloc(arrayB.nbytes)
cuda.memcpy_htod(arrayB_gpu, arrayB)
arrayB_Answer = numpy.empty_like(arrayB)
mod = SourceModule("""
__global__ void getLayer(int *arrayA, int *arrayB, int slice)
{
int idx = threadIdx.x + (blockIdx.x * blockDim.x); // x coordinate (numpy axis 2)
int idy = threadIdx.y + (blockIdx.y * blockDim.y); // y coordinate (numpy axis 1)
int idz = slice; //The "layer"
int x_width = (blockDim.x * gridDim.x);
int y_width = (blockDim.y * gridDim.y);
arrayB[idx + (x_width * idy)] = arrayA[idx + (x_width * idy) + (x_width * y_width) * idz];
}
""")
func = mod.get_function("getLayer")
func(arrayA_gpu, arrayB_gpu, my_slice, block=(depth, column, 1), grid=(1,1))
cuda.memcpy_dtoh(arrayB_Answer,arrayB_gpu)
print(arrayA[my_slice,:,:])
print(arrayB_Answer[:,:])
$ python t10.py
[[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]]
[[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]
[4 4 4 4 4 4 4 4 4 4]]
$
请注意,我还将您对 uint16
的使用更改为 int32
,以匹配内核类型 int
.
Note that I have also changed your use of uint16
to int32
, to match the kernel type int
.
这篇关于CUDA-从3D阵列提取图层的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!