CUDA:为什么Thrust在将数据上传到GPU时这么慢? [英] CUDA: Why Thrust is so slow on uploading data to GPU?

查看:110
本文介绍了CUDA:为什么Thrust在将数据上传到GPU时这么慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是GPU世界的新手,刚刚安装CUDA来编写一些程序.我玩过推力库,但发现将数据上传到GPU时它是如此之慢.在我不错的台式机上,主机到设备部分的传输速度仅为35MB/s.怎么会这样?

I'm new to GPU world and just installed CUDA for writing some program. I played with thrust library but find out that it is so slow when uploading data to GPU. Just about 35MB/s in host-to-device part on my not-bad desktop. How come it is?

环境:Visual Studio 2012,CUDA 5.0,GTX760,Intel-i7,Windows 7 x64

Environment: Visual Studio 2012, CUDA 5.0, GTX760, Intel-i7, Windows 7 x64

GPU带宽测试:

主机到设备的传输速度至少应为11GB/s,反之亦然!但是没有!

It is supposed to have at least 11GB/s of transfer speed for host to device or vice versa! But it didn't!

这是测试程序:

#include <iostream>
#include <ctime>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

#define N 32<<22

int main(void)
{
    using namespace std;

    cout<<"GPU bandwidth test via thrust, data size: "<< (sizeof(double)*N) / 1000000000.0 <<" Gbytes"<<endl;
    cout<<"============program start=========="<<endl;

    int now = time(0);
    cout<<"Initializing h_vec...";
    thrust::host_vector<double> h_vec(N,0.0f);
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Uploading data to GPU...";
    thrust::device_vector<double> d_vec = h_vec;
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl;

    now = time(0);
    cout<<"Downloading data to h_vec...";
    thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
    cout<<"time spent: "<<time(0)-now<<"secs"<<endl<<endl;

    system("PAUSE");
    return 0;
}

输出的程序:

  • 下载速度:不到1秒,与标称值相比非常有意义11GB/s.

  • Download speed: less than 1 sec, pretty make sense compare to nominal 11GB/s.

上传速度:1.07374GB/32秒将达到33.5 MB/s,这根本没有意义.

Upload speed: 1.07374GB /32 secs is about to be 33.5 MB/s, which doesn't make sense at all.

有人知道原因吗?还是仅仅是推力的方式?

Does anyone know the reason? Or is it just the way thrust is?

谢谢!

推荐答案

您的比较存在一些缺陷,其中一些缺陷已在注释中提及.

Your comparison has several flaws, some of which are covered in the comments.

  1. 您需要消除任何分配影响.您可以通过先进行一些热身"传输来做到这一点.
  2. 您需要消除任何启动"效应.您可以通过先进行一些热身"传输来做到这一点.
  3. 比较数据时,请记住, bandwidthTest 使用的是 PINNED 内存分配,该推力未使用.因此,推力数据传输速率将变慢.这通常贡献大约2倍的因素(即,固定的内存传输通常比可分页的内存传输快大约2倍.如果您想更好地与 bandwidthTest 进行比较,请使用-memory = pageable 开关.
  4. 您选择的计时功能可能不是最好的.cudaEvents对于计时CUDA操作非常可靠.
  1. You need to eliminate any allocation effects. You can do this by doing some "warm-up" transfers first.
  2. You need to eliminate any "start-up" effects. You can do this by doing some "warm-up" transfers first.
  3. When comparing the data, remember that bandwidthTest is using a PINNED memory allocation, which thrust does not use. Therefore the thrust data transfer rate will be slower. This typically contributes about a 2x factor (i.e. pinned memory transfers are typically about 2x faster than pageable memory transfers. If you want a better comparison with bandwidthTest run it with the --memory=pageable switch.
  4. Your choice of timing functions might not be the best. cudaEvents is pretty reliable for timing CUDA operations.

以下是执行适当计时的代码:

Here is a code which does proper timing:

$ cat t213.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/fill.h>

#define DSIZE ((1UL<<20)*32)

int main(){

  thrust::device_vector<int> d_data(DSIZE);
  thrust::host_vector<int> h_data(DSIZE);
  float et;
  cudaEvent_t start, stop;
  cudaEventCreate(&start);
  cudaEventCreate(&stop);

  thrust::fill(h_data.begin(), h_data.end(), 1);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());

  std::cout<< "warm up iteration " << d_data[0] << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 2);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  std::cout<< "warm up iteration " << h_data[0] << std::endl;
  thrust::fill(h_data.begin(), h_data.end(), 3);
  cudaEventRecord(start);
  thrust::copy(h_data.begin(), h_data.end(), d_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"host to device iteration " << d_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;
  thrust::fill(d_data.begin(), d_data.end(), 4);
  cudaEventRecord(start);
  thrust::copy(d_data.begin(), d_data.end(), h_data.begin());
  cudaEventRecord(stop);
  cudaEventSynchronize(stop);
  cudaEventElapsedTime(&et, start, stop);
  std::cout<<"device to host iteration " << h_data[0] << " elapsed time: " << (et/(float)1000) << std::endl;
  std::cout<<"apparent bandwidth: " << (((DSIZE*sizeof(int))/(et/(float)1000))/((float)1048576)) << " MB/s" << std::endl;

  std::cout << "finished" << std::endl;
  return 0;
}

我使用(我有一个带有cc2.0设备的PCIE Gen2系统)

I compile with (I have a PCIE Gen2 system with a cc2.0 device)

$ nvcc -O3 -arch=sm_20 -o t213 t213.cu

当我运行它时,会得到以下结果:

When I run it I get the following results:

$ ./t213
warm up iteration 1
warm up iteration 2
host to device iteration 3 elapsed time: 0.0476644
apparent bandwidth: 2685.44 MB/s
device to host iteration 4 elapsed time: 0.0500736
apparent bandwidth: 2556.24 MB/s
finished
$

这对我来说似乎是正确的,因为当我拥有PCIE Gen2系统时,系统上的 bandwidthTest 会在任一方向上报告大约6GB/s的速度.由于推力使用的是可分页的而不是固定的内存,因此我获得了大约一半的带宽,即3GB/s,推力报告的速度约为2.5GB/s.

This looks correct to me because a bandwidthTest on my system would report about 6GB/s in either direction as I have a PCIE Gen2 system. Since thrust uses pageable, not pinned memory, I get about half that bandwidth, i.e. 3GB/s, and thrust is reporting about 2.5GB/s.

为了进行比较,这是使用可分页内存在我的系统上进行的带宽测试:

For comparison, here is the bandwidth test on my system, using pageable memory:

$ /usr/local/cuda/samples/bin/linux/release/bandwidthTest --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Quadro 5000
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2718.2

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     2428.2

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     99219.1

$

这篇关于CUDA:为什么Thrust在将数据上传到GPU时这么慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆