多GPU基本用法 [英] multi-GPU basic usage

查看:25
本文介绍了多GPU基本用法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我如何使用两个设备进行改进以下代码的性能(向量总和)?是否可以同时"使用更多设备?如果是,如何管理向量在不同设备的全局内存上的分配?

How can I use two devices in order to improve for example the performance of the following code (sum of vectors)? Is it possible to use more devices "at the same time"? If yes, how can I manage the allocations of the vectors on the global memory of the different devices?

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <cuda.h>

#define NB 32
#define NT 500
#define N NB*NT

__global__ void add( double *a, double *b, double *c);

//===========================================
__global__ void add( double *a, double *b, double *c){

    int tid = threadIdx.x + blockIdx.x * blockDim.x; 

    while(tid < N){
        c[tid] = a[tid] + b[tid];
        tid += blockDim.x * gridDim.x;
    }

}

//============================================
//BEGIN
//===========================================
int main( void ) {

    double *a, *b, *c;
    double *dev_a, *dev_b, *dev_c;

    // allocate the memory on the CPU
    a=(double *)malloc(N*sizeof(double));
    b=(double *)malloc(N*sizeof(double));
    c=(double *)malloc(N*sizeof(double));

    // allocate the memory on the GPU
    cudaMalloc( (void**)&dev_a, N * sizeof(double) );
    cudaMalloc( (void**)&dev_b, N * sizeof(double) );
    cudaMalloc( (void**)&dev_c, N * sizeof(double) );

    // fill the arrays 'a' and 'b' on the CPU
    for (int i=0; i<N; i++) {
        a[i] = (double)i;
        b[i] = (double)i*2;
    }

    // copy the arrays 'a' and 'b' to the GPU
    cudaMemcpy( dev_a, a, N * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy( dev_b, b, N * sizeof(double), cudaMemcpyHostToDevice);

    for(int i=0;i<10000;++i)
        add<<<NB,NT>>>( dev_a, dev_b, dev_c );

    // copy the array 'c' back from the GPU to the CPU
    cudaMemcpy( c, dev_c, N * sizeof(double), cudaMemcpyDeviceToHost);

    // display the results
    // for (int i=0; i<N; i++) {
    //      printf( "%g + %g = %g
", a[i], b[i], c[i] );
    //  }
    printf("
GPU done
");

    // free the memory allocated on the GPU
    cudaFree( dev_a );
    cudaFree( dev_b );
    cudaFree( dev_c );
    // free the memory allocated on the CPU
    free( a );
    free( b );
    free( c );

    return 0;
}

提前谢谢你.米歇尔

推荐答案

自从 CUDA 4.0 发布以来,您所询问的类型的多 GPU 计算相对容易.在此之前,您需要使用多线程主机应用程序,每个 GPU 一个主机线程和某种线程间通信系统,以便在同一主机应用程序中使用多个 GPU.

Since CUDA 4.0 was released, multi-GPU computations of the type you are asking about are relatively easy. Prior to that, you would have need to use a multi-threaded host application with one host thread per GPU and some sort of inter-thread communication system in order to use mutliple GPUs inside the same host application.

现在可以对主机代码的内存分配部分执行类似的操作:

Now it is possible to do something like this for the memory allocation part of your host code:

double *dev_a[2], *dev_b[2], *dev_c[2];
const int Ns[2] = {N/2, N-(N/2)};

// allocate the memory on the GPUs
for(int dev=0; dev<2; dev++) {
    cudaSetDevice(dev);
    cudaMalloc( (void**)&dev_a[dev], Ns[dev] * sizeof(double) );
    cudaMalloc( (void**)&dev_b[dev], Ns[dev] * sizeof(double) );
    cudaMalloc( (void**)&dev_c[dev], Ns[dev] * sizeof(double) );
}

(免责声明:用浏览器编写,从未编译,从未测试,使用风险自负).

(disclaimer: written in browser, never compiled, never tested, use at own risk).

这里的基本思想是,当您在设备上执行操作时,您可以使用 cudaSetDevice 在设备之间进行选择.所以在上面的代码片段中,我假设了两个 GPU 并在每个上分配了内存 [(N/2) 在第一个设备上是双倍的,在第二个设备上是 N-(N/2)].

The basic idea here is that you use cudaSetDevice to select between devices when you are preforming operations on a device. So in the above snippet, I have assumed two GPUs and allocated memory on each [(N/2) doubles on the first device and N-(N/2) on the second].

从主机到设备的数据传输可以很简单:

The transfer of data from the host to device could be as simple as:

// copy the arrays 'a' and 'b' to the GPUs
for(int dev=0,pos=0; dev<2; pos+=Ns[dev], dev++) {
    cudaSetDevice(dev);
    cudaMemcpy( dev_a[dev], a+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy( dev_b[dev], b+pos, Ns[dev] * sizeof(double), cudaMemcpyHostToDevice);
}

(免责声明:用浏览器编写,从未编译,从未测试,使用风险自负).

(disclaimer: written in browser, never compiled, never tested, use at own risk).

代码的内核启动部分可能类似于:

The kernel launching section of your code could then look something like:

for(int i=0;i<10000;++i) {
    for(int dev=0; dev<2; dev++) {
        cudaSetDevice(dev);
        add<<<NB,NT>>>( dev_a[dev], dev_b[dev], dev_c[dev], Ns[dev] );
    }
}

(免责声明:用浏览器编写,从未编译,从未测试,使用风险自负).

(disclaimer: written in browser, never compiled, never tested, use at own risk).

请注意,我在内核调用中添加了一个额外的参数,因为调用内核的每个实例都可能使用不同数量的数组元素来处理.我将把它留给你来制定所需的修改.但是,同样,基本思想是相同的:使用 cudaSetDevice 选择给定的 GPU,然后以正常方式在其上运行内核,每个内核都有自己独特的参数.

Note that I have added an extra argument to your kernel call, because each instance of the kernel may be called with a different number of array elements to process. I Will leave it to you to work out the modifications required. But, again, the basic idea is the same: use cudaSetDevice to select a given GPU, then run kernels on it in the normal way, with each kernel getting its own unique arguments.

您应该能够将这些部分组合在一起以生成一个简单的多 GPU 应用程序.在最近的 CUDA 版本和硬件中还有许多其他功能可用于辅助多个 GPU 应用程序(如统一寻址、点对点设施更多),但这应该足以让您入门.CUDA SDK 中还有一个简单的多 GPU 应用程序,您可以查看更多想法.

You should be able to put these parts together to produce a simple multi-GPU application. There are a lot of other features which can be used in recent CUDA versions and hardware to assist multiple GPU applications (like unified addressing, the peer-to-peer facilities are more), but this should be enough to get you started. There is also a simple muLti-GPU application in the CUDA SDK you can look at for more ideas.

这篇关于多GPU基本用法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆