与MATLAB相比,使用cuSolver时SVD非常慢 [英] SVD very slow when using cuSolver as compared to MATLAB

查看:579
本文介绍了与MATLAB相比,使用cuSolver时SVD非常慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用cuSOLVER中的gesvd函数,在使用double数组或gpuArray的两种情况下,我发现它都比MATLAB中的svd函数要慢得多.

I'm trying to use the gesvd function from cuSOLVER which I found to be much slower than the svd function in MATLAB, for both cases using double array or gpuArray.

C ++代码[使用cuSolver] :

C++ code [using cuSolver]:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>
// Macro for timing kernel runs
#define START_METER {\
    cudaEvent_t start, stop;\
    float elapsedTime;\
    cudaEventCreate(&start);\
    cudaEventRecord(start, 0);
#define STOP_METER cudaEventCreate(&stop);\
    cudaEventRecord(stop, 0);\
    cudaEventSynchronize(stop);\
    cudaEventElapsedTime(&elapsedTime, start, stop);\
    printf("Elapsed time : %f ms\n", elapsedTime);\
                }

void cusolverSVD_Test()
{
    const int m = 64;
    const int rows = m;
    const int cols = m;
    /*       | 3.5 0.5 0 |
    *   A = | 0.5 3.5 0 |
    *       | 0   0   2 |
    *
    */
    double A[rows*m];
    for (int i = 0; i < cols; i++)
    {
        for (int j = 0; j < rows; j++)
        {
            A[i*rows + j] = (double)rand() / RAND_MAX;
            if (i == j){
                A[i*rows + j] += 1;
            }
        }
    }

    cusolverDnHandle_t handle;
    cusolverDnCreate(&handle);
    int lwork;

    cusolverDnDgesvd_bufferSize(
        handle,
        rows,
        cols,
        &lwork);

    double *d_A;
    cudaMalloc(&d_A, sizeof(double)*rows*cols);
    cudaMemcpy(d_A, A, sizeof(double)*rows*cols, cudaMemcpyHostToDevice);

    double *d_S;
    cudaMalloc(&d_S, sizeof(double)*rows);

    double *d_U;
    cudaMalloc(&d_U, sizeof(double)*rows*rows);

    double *d_VT;
    cudaMalloc(&d_VT, sizeof(double)*rows*rows);

    double *d_work;
    cudaMalloc(&d_work, sizeof(double)*lwork);

    double *d_rwork;
    cudaMalloc(&d_rwork, sizeof(double)*(rows - 1));

    int *devInfo;
    cudaMalloc(&devInfo, sizeof(int));

    for (int t = 0; t < 10; t++)
    {
        signed char jobu = 'A';
        signed char jobvt = 'A';
        START_METER
            cusolverDnDgesvd(
            handle,
            jobu,
            jobvt,
            rows,
            cols,
            d_A,
            rows,
            d_S,
            d_U,
            rows,
            d_VT,
            rows,
            d_work,
            lwork,
            d_rwork,
            devInfo);
        STOP_METER
    }

    cudaFree(d_A);
    cudaFree(d_rwork);
    cudaFree(d_S);
    cudaFree(d_U);
    cudaFree(d_VT);
    cudaFree(d_work);

}

int main()
{
    cusolverSVD_Test();
}

输出:

Elapsed time : 63.318016 ms
Elapsed time : 66.745316 ms
Elapsed time : 65.966530 ms
Elapsed time : 65.999939 ms
Elapsed time : 64.821053 ms
Elapsed time : 65.184547 ms
Elapsed time : 65.722916 ms
Elapsed time : 60.618786 ms
Elapsed time : 54.937569 ms
Elapsed time : 53.751263 ms
Press any key to continue . . .

**使用svd函数*的Matlab代码:

**Matlab code using the svd function*:

%% SVD on gpu
A = rand(64, 64) + eye(64);
tic
[~, ~, ~] = svd(A);
t = toc;
fprintf('CPU time: %f ms\n', t*1000);


d_A = gpuArray(A);
tic
[~, ~, ~] = svd(d_A);
t = toc;
fprintf('GPU time: %f ms\n', t*1000);

%% Output
% >> CPU time: 0.947754 ms
% >> GPU time: 2.168100 ms

Matlab是否使用一些更快的算法?还是我只是在犯一些错误?我确实需要可以在CUDA中使用的SVD良好的实现/算法.

Does Matlab use some faster algorithm? Or am I just doing some mistakes? I really need a good implementation/algorithm for SVD that I can use in CUDA.

更新:使用1000 x 1000矩阵时的执行时间

C ++ :

3655 ms (Double Precision)
2970 ms (Single Precision)

Matlab :

CPU time: 280.641123 ms
GPU time: 646.033498 ms

推荐答案

已知的问题是SVD算法不能很好地并行化.您会发现需要非常大的数组才能看到双精度的好处.对于GPU,单精度可能会获得更好的结果.如果仅请求一个输出,也会获得更好的结果,因为仅计算奇异值会使用更快的算法.

It is a known issue that the SVD algorithm does not parallelize well. You will find that you need very large arrays to see a benefit in double precision. You may get better results for single precision for your GPU. You will also get better results if you only request one output, since computing the singular values alone uses a much faster algorithm.

这在很大程度上还取决于您GPU的质量.如果您使用诸如GeForce GTX之类的图形卡,那么对于像SVD这样的算法而言,双精度的GPU确实不会带来太大好处.

This is also highly dependent on the quality of your GPU. If you are using a graphics card such as GeForce GTX you really aren't going to see much benefit for a GPU in double precision for an algorithm like SVD.

从根本上说,GPU内核的性能要比现代CPU内核低得多,并且它们以非常宽的并行度弥补了这一点. SVD算法过于依赖串行分解迭代.也许您可以通过重新考虑代数来解决问题,这样就不必每次都计算出完整的因式分解.

Fundamentally, GPU cores have a much lower performance than modern CPU cores, and they make up for this with very wide parallelism. The SVD algorithm is too highly dependent on a serial factorization iteration. Perhaps you can solve your problem by rethinking the algebra so you don't need to compute the complete factorization every time.

这篇关于与MATLAB相比,使用cuSolver时SVD非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆