CUFFT在VS2013 / Cuda7.0比VS2010 / Cuda4.2慢1000倍 [英] CUFFT is 1000x slower in VS2013/Cuda7.0 compared to VS2010/Cuda4.2

查看:228
本文介绍了CUFFT在VS2013 / Cuda7.0比VS2010 / Cuda4.2慢1000倍的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这个简单的CUFFT代码是在两个IDE上运行的

This simple CUFFT code was run on two IDEs -


  1. VS 2013 with Cuda 7.0

  2. VS 2010 with Cuda 4.2

我发现VS 2013 with Cuda 7.0是一个 1000 大约慢一点。在VS 2010中 0.6 ms 中执行的代码,在VS 2013上平均花费 520 ms

I found that VS 2013 with Cuda 7.0 was a 1000 times slower approximately. The code executed in 0.6 ms in VS 2010, and took 520 ms on VS 2013, both on an average.

#include "stdafx.h"
#include "cuda.h"
#include "cuda_runtime_api.h"
#include "cufft.h"
typedef cuComplex Complex;
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    const int SIZE = 10000;
    Complex *h_col = (Complex*)malloc(SIZE*sizeof(Complex));
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    Complex *d_col;
    cudaMalloc((void**)&d_col, SIZE*sizeof(Complex));
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftHandle plan;
    const int BATCH = 1;
    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds;

    return 0;
}

代码在同一台计算机上运行,​​具有相同的操作系统,卡,并立即一个接一个。两种情况下的配置都是x64 Release。你可以选择是使用C ++编译器还是CUDA C / C ++编译文件。

The code was run on the same computer, with the same OS, same Graphics card, and immediately one after another. The configuration in both cases was x64 Release. You get to choose whether to compile the file using C++ compiler or CUDA C/C++. I tried both the options on both the projects and it made no difference.

有任何想法来解决这个问题吗?

Any ideas to fix this?

FWIW,我得到了与Cuda 6.5在VS 2013作为Cuda 7相同的结果

FWIW, I get the same results with Cuda 6.5 on VS 2013 as Cuda 7

推荐答案

cufft库已经变得相当大4.2到7.0,并且它导致基本上更多的初始化时间。如果你删除这个初始化时间作为一个因素,我想你会发现将有远远小于1000x执行时间的差异。

The cufft library has gotten considerably larger from 4.2 to 7.0 and it results in substantially more initialization time. If you remove this initialization time as a factor, I think you will find there will be far less than 1000x difference in execution time.

这里是一个修改代码演示这: / p>

Here's a modified code demonstrating this:

$ cat t807.cu
#include <cufft.h>
#include <cuComplex.h>
typedef cuComplex Complex;
#include <iostream>
using namespace std;
int main(int argc, char* argv[])
{
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start);
    const int SIZE = 10000;
    Complex *h_col = (Complex*)malloc(SIZE*sizeof(Complex));
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    Complex *d_col;
    cudaMalloc((void**)&d_col, SIZE*sizeof(Complex));
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftHandle plan;
    const int BATCH = 1;
    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds << endl;

    cudaEventRecord(start);
    for (int i = 0; i < SIZE; i++)
    {
        h_col[i].x = i;
        h_col[i].y = i;
    }
    cudaMemcpy(d_col, h_col, SIZE*sizeof(Complex), cudaMemcpyHostToDevice);

    cufftPlan1d(&plan, SIZE, CUFFT_C2C, BATCH);
    cufftExecC2C(plan, d_col, d_col, CUFFT_FORWARD);

    cudaMemcpy(h_col, d_col, SIZE*sizeof(Complex), cudaMemcpyDeviceToHost);

    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    cufftDestroy(plan);
    cout << milliseconds << endl;

    return 0;
}
$ nvcc -o t807 t807.cu -lcufft
$ ./t807
94.8298
1.44778
$

上面的第二个数字代表基本上相同的代码,删除cufft初始化(因为它是在第一次通过)。

The second number above represents essentially the same code with the cufft initialization removed (since it was done on the first pass).

这篇关于CUFFT在VS2013 / Cuda7.0比VS2010 / Cuda4.2慢1000倍的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆