创建CUDA上下文的区别 [英] Difference on creating a CUDA context

查看:135
本文介绍了创建CUDA上下文的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序使用三个内核。为了得到加速,我做一个虚拟内存副本创建一个上下文如下:

I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:

__global__ void warmStart(int* f)
{
    *f = 0;
}

这是在我想要的内核之前启动的,如下所示:

which is launched before the kernels I want to time as follows:

int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");

我还阅读了关于创建上下文的其他最简单的方法: cudaFree ) cudaDevicesynchronize()。但是使用这些API调用给出的时间比使用虚拟内核要差。

I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.

在强制上下文之后,程序的执行时间对于虚拟内核 0.000031 秒, cudaDeviceSynchronize()和cudaFree(0)的c $ c> 0.000064 秒。

The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.

因此,我得出的结论是,启动一个内核初始化一些没有初始化的东西

Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.

那么,使用内核和使用API​​调用创建上下文有什么区别?

So, what's the difference of creating a context in these two ways, using a kernel and using an API call?

我在Linux下使用CUDA 4.0在GTX480中运行测试。

I run the test in a GTX480, using CUDA 4.0 under Linux.

推荐答案

每个CUDA上下文具有执行内核所需的内存分配,不需要分配给syncrhonize,分配内存或空闲内存。上下文存储器的初始分配和这些分配的大小调整被推迟,直到内核需要这些资源。这些分配的示例包括本地内存缓冲区,设备堆和printf堆。

Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.

这篇关于创建CUDA上下文的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆