与MATLAB相比,使用cuSolver时SVD非常慢 [英] SVD very slow when using cuSolver as compared to MATLAB
问题描述
我正在尝试使用cuSOLVER
中的gesvd
函数,在使用double
数组或gpuArray
的两种情况下,我发现它都比MATLAB中的svd
函数要慢得多.
I'm trying to use the gesvd
function from cuSOLVER
which I found to be much slower than the svd
function in MATLAB, for both cases using double
array or gpuArray
.
C ++代码[使用cuSolver
] :
C++ code [using cuSolver
]:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>
// Macro for timing kernel runs
#define START_METER {\
cudaEvent_t start, stop;\
float elapsedTime;\
cudaEventCreate(&start);\
cudaEventRecord(start, 0);
#define STOP_METER cudaEventCreate(&stop);\
cudaEventRecord(stop, 0);\
cudaEventSynchronize(stop);\
cudaEventElapsedTime(&elapsedTime, start, stop);\
printf("Elapsed time : %f ms\n", elapsedTime);\
}
void cusolverSVD_Test()
{
const int m = 64;
const int rows = m;
const int cols = m;
/* | 3.5 0.5 0 |
* A = | 0.5 3.5 0 |
* | 0 0 2 |
*
*/
double A[rows*m];
for (int i = 0; i < cols; i++)
{
for (int j = 0; j < rows; j++)
{
A[i*rows + j] = (double)rand() / RAND_MAX;
if (i == j){
A[i*rows + j] += 1;
}
}
}
cusolverDnHandle_t handle;
cusolverDnCreate(&handle);
int lwork;
cusolverDnDgesvd_bufferSize(
handle,
rows,
cols,
&lwork);
double *d_A;
cudaMalloc(&d_A, sizeof(double)*rows*cols);
cudaMemcpy(d_A, A, sizeof(double)*rows*cols, cudaMemcpyHostToDevice);
double *d_S;
cudaMalloc(&d_S, sizeof(double)*rows);
double *d_U;
cudaMalloc(&d_U, sizeof(double)*rows*rows);
double *d_VT;
cudaMalloc(&d_VT, sizeof(double)*rows*rows);
double *d_work;
cudaMalloc(&d_work, sizeof(double)*lwork);
double *d_rwork;
cudaMalloc(&d_rwork, sizeof(double)*(rows - 1));
int *devInfo;
cudaMalloc(&devInfo, sizeof(int));
for (int t = 0; t < 10; t++)
{
signed char jobu = 'A';
signed char jobvt = 'A';
START_METER
cusolverDnDgesvd(
handle,
jobu,
jobvt,
rows,
cols,
d_A,
rows,
d_S,
d_U,
rows,
d_VT,
rows,
d_work,
lwork,
d_rwork,
devInfo);
STOP_METER
}
cudaFree(d_A);
cudaFree(d_rwork);
cudaFree(d_S);
cudaFree(d_U);
cudaFree(d_VT);
cudaFree(d_work);
}
int main()
{
cusolverSVD_Test();
}
输出:
Elapsed time : 63.318016 ms
Elapsed time : 66.745316 ms
Elapsed time : 65.966530 ms
Elapsed time : 65.999939 ms
Elapsed time : 64.821053 ms
Elapsed time : 65.184547 ms
Elapsed time : 65.722916 ms
Elapsed time : 60.618786 ms
Elapsed time : 54.937569 ms
Elapsed time : 53.751263 ms
Press any key to continue . . .
**使用svd
函数*的Matlab代码:
**Matlab code using the svd
function*:
%% SVD on gpu
A = rand(64, 64) + eye(64);
tic
[~, ~, ~] = svd(A);
t = toc;
fprintf('CPU time: %f ms\n', t*1000);
d_A = gpuArray(A);
tic
[~, ~, ~] = svd(d_A);
t = toc;
fprintf('GPU time: %f ms\n', t*1000);
%% Output
% >> CPU time: 0.947754 ms
% >> GPU time: 2.168100 ms
Matlab是否使用一些更快的算法?还是我只是在犯一些错误?我确实需要可以在CUDA
中使用的SVD良好的实现/算法.
Does Matlab use some faster algorithm? Or am I just doing some mistakes? I really need a good implementation/algorithm for SVD that I can use in CUDA
.
更新:使用1000 x 1000矩阵时的执行时间
C ++ :
3655 ms (Double Precision)
2970 ms (Single Precision)
Matlab :
CPU time: 280.641123 ms
GPU time: 646.033498 ms
推荐答案
已知的问题是SVD算法不能很好地并行化.您会发现需要非常大的数组才能看到双精度的好处.对于GPU,单精度可能会获得更好的结果.如果仅请求一个输出,也会获得更好的结果,因为仅计算奇异值会使用更快的算法.
It is a known issue that the SVD algorithm does not parallelize well. You will find that you need very large arrays to see a benefit in double precision. You may get better results for single precision for your GPU. You will also get better results if you only request one output, since computing the singular values alone uses a much faster algorithm.
这在很大程度上还取决于您GPU的质量.如果您使用诸如GeForce GTX之类的图形卡,那么对于像SVD这样的算法而言,双精度的GPU确实不会带来太大好处.
This is also highly dependent on the quality of your GPU. If you are using a graphics card such as GeForce GTX you really aren't going to see much benefit for a GPU in double precision for an algorithm like SVD.
从根本上说,GPU内核的性能要比现代CPU内核低得多,并且它们以非常宽的并行度弥补了这一点. SVD算法过于依赖串行分解迭代.也许您可以通过重新考虑代数来解决问题,这样就不必每次都计算出完整的因式分解.
Fundamentally, GPU cores have a much lower performance than modern CPU cores, and they make up for this with very wide parallelism. The SVD algorithm is too highly dependent on a serial factorization iteration. Perhaps you can solve your problem by rethinking the algebra so you don't need to compute the complete factorization every time.
这篇关于与MATLAB相比,使用cuSolver时SVD非常慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!