CUDA:内存复制到GPU 1在多GPU中速度较慢 [英] CUDA: Memory copy to GPU 1 is slower in multi-GPU
问题描述
我的公司设置了两个GTX 295,因此在一个服务器中总共有4个GPU,我们有多个服务器。
我们GPU 1具体是慢的,与GPU 0,2和3相比,所以我写了一个小的速度测试,以帮助找到问题的原因。
My company has a setup of two GTX 295, so a total of 4 GPUs in a server, and we have several servers. We GPU 1 specifically was slow, in comparison to GPU 0, 2 and 3 so I wrote a little speed test to help find the cause of the problem.
//#include <stdio.h>
//#include <stdlib.h>
//#include <cuda_runtime.h>
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <cutil.h>
__global__ void test_kernel(float *d_data) {
int tid = blockDim.x*blockIdx.x + threadIdx.x;
for (int i=0;i<10000;++i) {
d_data[tid] = float(i*2.2);
d_data[tid] += 3.3;
}
}
int main(int argc, char* argv[])
{
int deviceCount;
cudaGetDeviceCount(&deviceCount);
int device = 0; //SELECT GPU HERE
cudaSetDevice(device);
cudaEvent_t start, stop;
unsigned int num_vals = 200000000;
float *h_data = new float[num_vals];
for (int i=0;i<num_vals;++i) {
h_data[i] = float(i);
}
float *d_data = NULL;
float malloc_timer;
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord( start, 0 );
cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
cudaMalloc((void**)&d_data, sizeof(float)*num_vals);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &malloc_timer, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
float mem_timer;
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord( start, 0 );
cudaMemcpy(d_data, h_data, sizeof(float)*num_vals,cudaMemcpyHostToDevice);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &mem_timer, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
float kernel_timer;
cudaEventCreate(&start);
cudaEventCreate(&stop); cudaEventRecord( start, 0 );
test_kernel<<<1000,256>>>(d_data);
cudaEventRecord( stop, 0 ); cudaEventSynchronize( stop ); cudaEventElapsedTime( &kernel_timer, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
printf("cudaMalloc took %f ms\n",malloc_timer);
printf("Copy to the GPU took %f ms\n",mem_timer);
printf("Test Kernel took %f ms\n",kernel_timer);
cudaMemcpy(h_data,d_data, sizeof(float)*num_vals,cudaMemcpyDeviceToHost);
delete[] h_data;
return 0;
}
结果是
GPU0
cudaMalloc花了0.908640 ms
复制到GPU花了296.058777 ms
测试内核花了326.721283 ms
GPU0 cudaMalloc took 0.908640 ms Copy to the GPU took 296.058777 ms Test Kernel took 326.721283 ms
GPU1
cudaMalloc花了0.913568 ms
复制到GPU的时间为 663.182251 ms
测试内核耗时326.710785 ms
GPU1 cudaMalloc took 0.913568 ms Copy to the GPU took 663.182251 ms Test Kernel took 326.710785 ms
GPU2
cudaMalloc takes 0.925600 ms
复制到GPU花了296.915039 ms
测试内核花了327.127930 ms
GPU2 cudaMalloc took 0.925600 ms Copy to the GPU took 296.915039 ms Test Kernel took 327.127930 ms
GPU3
cudaMalloc花了0.920416 ms
复制到GPU花了296.968384 ms
测试内核花了327.038696 ms
GPU3 cudaMalloc took 0.920416 ms Copy to the GPU took 296.968384 ms Test Kernel took 327.038696 ms
正如你所看到的,cudaMemcpy到GPU是双倍的的时间。这在我们的所有服务器之间是一致的,它总是GPU1是慢的。
任何想法为什么这可能是?
所有服务器都运行Windows XP。
As you can see, the cudaMemcpy to the GPU is well double the amount of time for GPU1. This is consistent between all our servers, it is always GPU1 that is slow. Any ideas why this may be? All servers are running windows XP.
推荐答案
这是一个驱动程序问题。更新到最新的驱动程序修复它
This was a driver issue. Updating to the latest driver fixed it
这篇关于CUDA:内存复制到GPU 1在多GPU中速度较慢的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!