使用CUDA的GPU上的并行Kronecker张量产品 [英] Parallel Kronecker tensor product on GPUs using CUDA

查看：105 发布时间：2020/4/30 12:02:17 matlab parallel-processing cuda gpu linear-algebra

本文介绍了使用CUDA的GPU上的并行Kronecker张量产品的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用[带有matlab parallel.gpu.CUDAkernel] [PTX文件]的[PTX文件]在GPU上并行处理[此文件] [1].我的[kron张量积] [3]问题如下.我的代码应通过将第一个向量a=<32x1>的每个元素乘以另一个向量b=<1x32>的所有元素来乘以两个向量kron(a,b)，并且输出向量大小将为k<32x32>=a.*b.我尝试用C ++编写它，并且它起作用了，因为我只关心求和2d数组的所有元素.我以为可以简化为一维数组，因为m=sum(sum(kron(a,b)))是我正在处理的代码

I am working in parallelise [this file][1] on GPU using [PTX file with matlab parallel.gpu.CUDAkernel][2]. My problem with [kron tensor product][3] is the following. My code should multiply two vectors kron(a,b) by multiplying each element of the first vector a=<32x1> by the all elements of the other vector b=<1x32> and the output vector size will be k<32x32>=a.*b. I tried to write it in C++ and it worked, as I only concern about summing all the elements of 2d array. I thought I can make it easy as 1D array because m=sum(sum(kron(a,b))) is the code I am working on

for(i=0;i<32;i++)
 for(j=0;j<32;j++)
   k[i*32+j]=a[i]*b[j]

这意味着要使第a[i]个元素乘以b中的每个元素，我虽然要使用32块，但每个块都有一个32线程，并且代码应为

It meant to have the a[i]th element multiply by eachelement in b and I though to go with 32 blocks with each block has a 32 threads and the code should be

__global__ void myKrom(int* c,int* a, int*b) {
  int i=blockDim.x*blockIdx.x+threadIdx.x;
  while(i<32) {
    c[i]=a[blockIdx.x]+b[blockDim.x*blockIdx.x+threadIdx.x];
  }

这应该会成功，因为blockIdx.x是外部循环，但事实并非如此.任何人都可以告诉我在哪里，我可以要求采用并行方式进行并行求和.

That should make the trick as the blockIdx.x is the outer loop, but it didn't. Could any body tell me where, may i ask for parallel way to do the parallel sum.

使用CUDA的GPU上的并行Kronecker张量产品 [英] Parallel Kronecker tensor product on GPUs using CUDA

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用CUDA的GPU上的并行Kronecker张量产品 [英] Parallel Kronecker tensor product on GPUs using CUDA

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭