子矩阵计算 [英] Sub-Matrix computations

查看:215
本文介绍了子矩阵计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想计算矩阵的两个子矩阵之间的成对距离。例如,我有一个矩阵A(MxN)和该矩阵B1(mxn)和B2(kxt)的两个块。更具体地,我想计算B1(1,1)元素与B2的所有其他元素的距离,并对所有B1元素进行这个过程。为了更清楚,B1和B2可以不是矩阵的紧凑部分,并且基本上我知道的信息是B1和B2的元素在矩阵A上的坐标。这里是一个示例。

  for(int i = 0; i  for(int j = 0; j < nRowsCoordsB2; j ++){// B2B 

的nRows // CoordsofB1是包含B1子矩阵的元素坐标的nRowsB1x2矩阵

a_x = CoordsofB1 [i] //取对应行的x坐标i
a_y = CoordsofB1 [i + nRowsCoordsB1]; //取对应行的y坐标

b_x = CoordsofB2 [j];
b_y = CoordsofB2 [j + nRowsCoordsB2];

int element1 = A [a_x + a_y * nRowsofA];
int element2 = A [b_x + b_y * nRowsofA];
sum + = abs(element1 - element2);

}
}
* Output = sum /(float)(numberOfElementsofB1 * numberOfElementsofB2);

现在我想使用CUDA加快计算速度:)因为我是新的Cuda透视图,我发现它小复杂。从现在开始,我认为我已经理解了在Matrix级别中分配块线程的逻辑,但是这里我有两个不同的矩阵的部分,不同的大小,CoordsofB1和CoordsofB2,困惑我一点如何我可以访问他们采取坐标并在孔矩阵中使用它们。我认为我们应该使用约束在A中工作,但是我没有明确的想法。



还有事实,在for循环的结尾,与数量混淆我们将在cuda翻译代码中合并的人。



任何建议 - snippets-examples-references将是伟大的。



PS:我使用列主顺序的原因是因为代码是在matlab中评估的。



strong> UPDATE:我们可以分配大小等于最大子矩阵B1或B2的大小的线程块,并使用正确的条件与它们一起工作吗?我评论最后一行,因为我不知道该怎么做。任何意见?

  int r = blockDim.x * blockIdx.x + threadIdx.x; // rows 
if(r< nRowsCoordsB1){

a_x = CoordsofB1 [r];
a_y = CoordsofB1 [r + nRowsCoordsB1];
if(r< nRowsCoordsB2;){

b_x = CoordsofB2 [r];
b_y = CoordsofB2 [r + nRowsCoordsB2];
int element1 = A [a_x + a_y * nRowsofA];
int element2 = A [b_x + b_y * nRowsofA];
sum + = abs(element1 - element2);

}
}
// * Output = sum /(float)(numberOfElementsofB1 * numberOfElementsofB2);

这里是草图



我有B1和B2中每个元素的坐标, (B1(1,1)-B2(1,1))+(B1(1,1))之间的值之间的值



< - B2(1,2))+ ... +(B1(1,1)-B2(:, :))] +



[(B1(1,2)-B2(1,1))+(B1(1,2)-B2(1,2))+ ... + ,2)-B2(:, :))] +



[ - B2(1,1))+(B1(:, :) -B2(1,2))+ ... +(B1(:, :) -B2(:, :)) strong>。

解决方案

如果我正确理解,你想做什么可以写入以下matlab代码。

  rep_B1 = repmat(B1(:),1,length(B2(:)) 
rep_B2 = repmat(B2(:)',length(B1(:),1));
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean(absdiff_B1B2(:));

您会注意到在还原之前,有一个矩阵 absdiff_B1B2 < $($)code>长度(B1(:)) x 长度(B2(:)) m * n x k * t (这个矩阵从不存储到全局mem中,如果你实现上面的代码一个CUDA内核)。您可以将此矩阵划分为16x16个子矩阵,并使用每个子矩阵一个256线程块将工作负载分解为GPU。



另一方面,你可以使用推力,让你的生活更轻松。



更新



由于 B1 B2 A 的子矩阵,您可以先使用 cudaMemcpy2D() 将它们复制到线性空间,然后使用内核构造然后减少矩阵 absdiff_B1B2



对于最后的标准化操作



这里是使用推力来显示如何构造和减少矩阵的代码 absdiff_B1B2 在单个内核中。然而,您会发现构造过程不使用共享内存,并且未优化。使用共享内存的进一步优化将提高性能。

  #include< thrust / device_vector.h> 
#include< thrust / inner_product.h>
#include< thrust / iterator / permutation_iterator.h>
#include< thrust / iterator / transform_iterator.h>
#include< thrust / iterator / counting_iterator.h>

template< typename T>
struct abs_diff
{
inline __host__ __device__ T operator()(const T& x,const T& y)
{
return abs(x - y);
}
};

int main()
{
使用命名空间thrust :: placeholder;

const int m = 98;
const int n = 87;
int k = 76;
int t = 65;
double result;

thrust :: device_vector< double> B1(m * n,1.0);
thrust :: device_vector< double> B2(k * t,2.0);

result = thrust :: inner_product(
thrust :: make_permutation_iterator(
B1.begin(),
thrust :: make_transform_iterator(
thrust :: make_counting_iterator(0),
_1%(m * n))),
thrust :: make_permutation_iterator(
B1.begin(),
thrust :: make_transform_iterator b thrust :: make_counting_iterator(0),
_1%(m * n)))+(m * n * k * t),
thrust :: make_permutation_iterator(
B2.begin ),
thrust :: make_transform_iterator(
thrust :: make_counting_iterator(0),
_1 /(m * n))),
0.0,
thrust ::加上< double>(),
abs_diff< double>());
result / = m * n * k * t;

std :: cout<<结果< std :: endl;

return 0;
}


I want to calculate the pair wise distance between two sub-matrices of a matrix. For example I have a matrix A (MxN) and two blocks of that matrix B1 (mxn) and B2 (kxt). More specifically, I want to calculate the distance of the B1(1,1) element from all the other elements of the B2 and to do this process for all the B1 elements. To be more clear the B1 and B2 may be not compact parts of the matrices and basically the information I know is the coordinates of the elements of B1 and B2 on the matrix A. Here is an example.

for(int i = 0; i < nRowsCoordsB1 ; i++ ){//nRows of B1
  for(int j = 0; j < nRowsCoordsB2 ; j++ ){//nRows of B2

    //CoordsofB1 is a nRowsB1x2 matrix that contains the element coordinates of the B1 sub matrix

    a_x = CoordsofB1[ i ]; //take the x coord of the corresponding row i
    a_y = CoordsofB1[ i + nRowsCoordsB1 ]; //take the y coord of the corresponding row

    b_x = CoordsofB2[ j ];
    b_y = CoordsofB2[ j + nRowsCoordsB2 ];

    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2);   

Now I want to speedup computations with CUDA :) Because I am new in Cuda perspective I found it a little complicated. Since now I think that I have understand the logic of allocating block threads in Matrix level but here the fact that I have two different parts of the matrix with different size, CoordsofB1 and CoordsofB2, confuse me a little on how I can access them take the coordinates and use them in the hole matrix. I thought that we should work in A using constrains but I did not come with a clear thought.

Also the fact that in the end of the for loops the sum is divided with a quantity confuse me on who we would combined in the cuda translated code.

Any suggestions-snippets-examples-references would be great.

PS: the reason I use column-major ordering is because the code is evaluated in matlab.

UPDATE: Can we allocate thread block of size equal the size of the biggest sub matrix B1 or B2 and work with them using the correct conditions? I comment the last line because I was not sure about what to do with it. Any comments?

int r = blockDim.x * blockIdx.x + threadIdx.x; // rows
if( r < nRowsCoordsB1 ){       

  a_x = CoordsofB1[ r ]; 
  a_y = CoordsofB1[ r + nRowsCoordsB1 ]; 
  if( r < nRowsCoordsB2 ;){

    b_x = CoordsofB2[ r ];
    b_y = CoordsofB2[ r + nRowsCoordsB2 ];
    int element1 = A[ a_x + a_y*nRowsofA ];
    int element2 = A[ b_x + b_y*nRowsofA ] ;
    sum +=abs( element1 - element2 ) ;

  }
}
//*Output = sum/(float)(numberOfElementsofB1*numberOfElementsofB2); 

Here a sketch

I have the coordinates of each element inside the B1 and B2 and I want to calculate the differences between the values in

[ (B1(1,1) - B2(1,1)) + (B1(1,1) - B2(1,2)) + ... + (B1(1,1) - B2(:,:)) ] +

[ (B1(1,2) - B2(1,1)) + (B1(1,2) - B2(1,2)) + ... + (B1(1,2) - B2(:,:)) ] +

[ (B1(:,:) - B2(1,1)) + (B1(:,:) - B2(1,2)) + ... + (B1(:,:) - B2(:,:)) ].

解决方案

If I understand it correctly, what you are trying to do can be written in the following matlab code.

rep_B1 = repmat(B1(:),  1, length(B2(:)) );
rep_B2 = repmat(B2(:)', length(B1(:), 1) );
absdiff_B1B2 = abs(rep_B1 - repB2);
Result = mean( absdiff_B1B2(:) );

Your will notice that before the reduction, there is a matrix absdiff_B1B2 of the size length(B1(:)) x length(B2(:)), i.e. m*n x k*t (this matrix is never stored to global mem if you implement the above code in one CUDA kernel). You could partition this matrix into 16x16 sub-matrices and use one 256-thread-block per sub-matrix to decompose the workload to GPU.

On the other hand, you could use thrust to make your life easier.

Update

Since B1 and B2 are sub-matrices of A, you could first use cudaMemcpy2D() to copy them to linear space, then use a kernel to construct and then reduce the matrix absdiff_B1B2.

For the final normalization operation (last line of your code), you could do it on CPU.

Here's the code using thrust to show how to construct and reduce the matrix absdiff_B1B2 in a single kernel. However you will find that the construction procedure use no shared memory and is not optimized. Further optimization using shared mem will improve the performance.

#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>

template<typename T>
struct abs_diff
{
    inline __host__ __device__ T operator()(const T& x, const T& y)
    {
        return abs(x - y);
    }
};

int main()
{
    using namespace thrust::placeholders;

    const int m = 98;
    const int n = 87;
    int k = 76;
    int t = 65;
    double result;

    thrust::device_vector<double> B1(m * n, 1.0);
    thrust::device_vector<double> B2(k * t, 2.0);

    result = thrust::inner_product(
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))),
            thrust::make_permutation_iterator(
                    B1.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 % (m * n))) + (m * n * k * t),
            thrust::make_permutation_iterator(
                    B2.begin(),
                    thrust::make_transform_iterator(
                            thrust::make_counting_iterator(0),
                            _1 / (m * n))),
            0.0,
            thrust::plus<double>(),
            abs_diff<double>());
    result /= m * n * k * t;

    std::cout << result << std::endl;

    return 0;
}

这篇关于子矩阵计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆