cudaDeviceReset为多个gpu的 [英] cudaDeviceReset for multiple gpu's
问题描述
我目前正在使用一个gpu服务器,它有4 Tesla T10 gpu的。虽然我不断测试内核,并经常使用ctrl-C杀死进程,我在一个简单的设备查询代码的末尾添加了几行。代码如下:
I am currently working on a gpu server which has 4 Tesla T10 gpu's. While I keep testing the kernels and have to frequently kill the processes using ctrl-C, I added a few lines to the end of a simple device query code. The code is given below :
#include <stdio.h>
// Print device properties
void printDevProp(cudaDeviceProp devProp)
{
printf("Major revision number: %d\n", devProp.major);
printf("Minor revision number: %d\n", devProp.minor);
printf("Name: %s\n", devProp.name);
printf("Total global memory: %u\n", devProp.totalGlobalMem);
printf("Total shared memory per block: %u\n", devProp.sharedMemPerBlock);
printf("Total registers per block: %d\n", devProp.regsPerBlock);
printf("Warp size: %d\n", devProp.warpSize);
printf("Maximum memory pitch: %u\n", devProp.memPitch);
printf("Maximum threads per block: %d\n", devProp.maxThreadsPerBlock);
for (int i = 0; i < 3; ++i)
printf("Maximum dimension %d of block: %d\n", i, devProp.maxThreadsDim[i]);
for (int i = 0; i < 3; ++i)
printf("Maximum dimension %d of grid: %d\n", i, devProp.maxGridSize[i]);
printf("Clock rate: %d\n", devProp.clockRate);
printf("Total constant memory: %u\n", devProp.totalConstMem);
printf("Texture alignment: %u\n", devProp.textureAlignment);
printf("Concurrent copy and execution: %s\n", (devProp.deviceOverlap ? "Yes" : "No"));
printf("Number of multiprocessors: %d\n", devProp.multiProcessorCount);
printf("Kernel execution timeout: %s\n", (devProp.kernelExecTimeoutEnabled ? "Yes" : "No"));
return;
}
int main()
{
// Number of CUDA devices
int devCount;
cudaGetDeviceCount(&devCount);
printf("CUDA Device Query...\n");
printf("There are %d CUDA devices.\n", devCount);
// Iterate through devices
for (int i = 0; i < devCount; ++i)
{
// Get device properties
printf("\nCUDA Device #%d\n", i);
cudaDeviceProp devProp;
cudaGetDeviceProperties(&devProp, i);
printDevProp(devProp);
}
printf("\nPress any key to exit...");
char c;
scanf("%c", &c);
**for (int i = 0; i < devCount; i++) {
cudaSetDevice(i);
cudaDeviceReset();
}**
return 0;
}
我的查询与main我设置每个设备一个一个,然后使用cudaResetDevice命令。我得到一个奇怪的感觉,这个代码,虽然不产生任何错误,但我不能重置所有的设备。相反,程序每次仅重置默认设备,即设备0。任何人都可以告诉我应该如何重置4个设备。
My query is related to the for loop just before the main() ends in which I set each device one by one and then use cudaResetDevice command. I get a strange feeling that this code, although doesnt produce any error but I am not able to reset all the devices. Instead, the program is resetting only the default device i.e device 0 each time. Can anyone tell me what should I do to reset each of the 4 devices.
感谢
推荐答案
这可能太晚了,但如果你写一个信号处理函数,你可以摆脱内存泄漏,并以一种确定的方式重置设备:
This is probably too late but if you write a signal-handler function you can get rid of the memory leaks and reset the device in a sure way:
// State variables for
extern int no_sigint;
int no_sigint = 1;
extern int interrupts;
int interrupts = 0;
/* Catches signal interrupts from Ctrl+c.
If 1 signal is detected the simulation finishes the current frame and
exits in a clean state. If Ctrl+c is pressed again it terminates the
application without completing writes to files or calculations but
deallocates all memory anyway. */
void
sigint_handler (int sig)
{
if (sig == SIGINT)
{
interrupts += 1;
std::cout << std::endl
<< "Aborting loop.. finishing frame."
<< std::endl;
no_sigint = 0;
if (interrupts >= 2)
{
std::cerr << std::endl
<< "Multiple Interrupts issued: "
<< "Clearing memory and Forcing immediate shutdown!"
<< std::endl;
// write a function to free dynamycally allocated memory
free_mem ();
int devCount;
cudaGetDeviceCount (&devCount);
for (int i = 0; i < devCount; ++i)
{
cudaSetDevice (i);
cudaDeviceReset ();
}
exit (9);
}
}
}
p>
....
int main(){
.....
for (int simulation_step=1 ; simulation_step < SIM_STEPS && no_sigint; ++simulation_step)
{
.... simulation code
}
free_mem();
... cuda device resets
return 0;
}
如果您使用此代码(您甚至可以将第一个代码段包含在外部你可以有ctrl + c的两个级别的控制:第一次按停止你的模拟并正常退出,但应用程序完成渲染的步骤是伟大的停止优雅和正确的结果,如果你按ctrl + c它再次关闭应用程序释放所有内存。
If you use this code (you can even include the first snippet in an external header, it works. You can have 2 levels of control of ctrl+c: the first press stops your simulation and exits normally but the application finishes rendering the step which is great to stop gracefully and have correct results, if you press ctrl+c again it closes the application freeing all memory.
这篇关于cudaDeviceReset为多个gpu的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!