Cuda:最小二乘解,速度差 [英] Cuda: least square solving , poor in speed

查看:709
本文介绍了Cuda:最小二乘解,速度差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我使用Cuda来编写一个称为正交匹配追踪的算法。在我丑陋的Cuda代码中,整个迭代需要60秒,而Eigen只需要3秒...

Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec...

在我的代码中,Matrix A是[640,1024]是[640,1],在每个步骤中,从A中选择一些向量以构成称为A_temp [640,itera],iter = 1:500的新矩阵。

In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select .

我想从A_temp * x_temp = y获取x_temp [itera,1],使用least- square,我使用一个cula API'culaDeviceSgels'和cublas矩阵向量乘法API。

I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API.

所以culaDeviceSgels将调用500次,我认为这将比Eigen lib的QR.Sovler。

So the culaDeviceSgels would call 500 times , and I think this would be faster than Eigen lib's QR.Sovler .

我检查了Nisight的性能分析,我发现custreamdestory需要很长时间。我初始cublas之前迭代和destory它后,我得到的结果。所以我想知道什么是custreamdestory,不同于cublasdestory?

I check the Nisight performence anlysis , I found the custreamdestory takes a long time . I initial cublas before iteration and destory it after I get the result . So I want to know the what is the custreamdestory , different with cublasdestory?

主要的问题是memcpy和函数'gemm_kernel1x1val'。我想这个函数来自'culaDeviceSgels'

The main problem is memcpy and function 'gemm_kernel1x1val' . I think this function is from 'culaDeviceSgels'

while(itera <500):我使用cublasSgemv和cublasIsamax获取MaxDex_Host [itera],然后

while(itera<500): I use cublasSgemv and cublasIsamax to get MaxDex_Host[itera] , then

        MaxDex_Host[itera]=pos;
    itera++; 
    float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
    for (int j=0;j<itera;j++) // to  get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose , 
    {
        for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
        {
            A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
        }
    }
          // I must allocate one more array because culaDeviceSgels will decompose the one input Array ,  and I want to use A_temp after least-square solving.
    float* A_temp_gpu;
    float* A_temp2_gpu;  
    cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
    cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
    cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
    cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
    culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value ,  stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
     float* x_temp;
    cudaMalloc((void**)&x_temp,Size_float*itera);
    cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);

Cuda的内存管理似乎太复杂了,有没有其他方便的方法来解决最小二乘? p>

Cuda's memory manage seems too complex , is there any other convenience method to solve least-square?

推荐答案

我认为 custreamdestory gemm_kernel1x1val 由您使用的API内部调用,因此与它们无关。

I think that custreamdestory and gemm_kernel1x1val are internally called by the APIs you are using, so there is not much to do with them.

为了改善您的代码,我建议您执行以下操作。

To improve your code, I would suggest to do the following.


  1. 您可以通过保留矩阵的设备副本来除去 A_temp_cpu code> A 。然后,您可以将 A 的行复制到 A_temp_gpu A_temp2_gpu 由内核赋值。这将避免执行前两个 cudaMemcpy

  2. 您可以在 A_temp_gpu A_temp2_gpu > while 使用 itera 的最大可能值而不是 itera 循环。这将避免循环中的前两个 cudaMalloc 。这同样适用于 x_temp

  3. 只要知道, culaDeviceSgels 求解线性方程组。我想你也可以做同样只使用cuBLAS API。例如,您可以先通过 cublasDgetrfBatched()执行LU因式分解,然后使用 cublasStrsv()两次来求解两个线性系统。您可能希望看看此解决方案是否导致更快的算法。

  1. You can get rid of A_temp_cpu by keeping a device copy of the matrix A. Then you can copy the rows of A into the rows of A_temp_gpu and A_temp2_gpu by a kernel assignment. This would avoid performing the first two cudaMemcpys.
  2. You can preallocate A_temp_gpu and A_temp2_gpu outside the while loop by using the maximum possible value of itera instead of itera. This will avoid the first two cudaMallocs inside the loop. The same applies to x_temp.
  3. As long as I know, culaDeviceSgels solves a linear system of equations. I think you can do the same also by using cuBLAS APIs only. For example, you can perform an LU factorization first by cublasDgetrfBatched() and then use cublasStrsv() two times to solve the two arising linear systems. You may wish to see if this solution leads to a faster algorithm.

这篇关于Cuda:最小二乘解,速度差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆