并发内核执行在AMD A10 APU中不起作用 [英] Concurrent kernel execution not working in AMD A10 APU

查看:89
本文介绍了并发内核执行在AMD A10 APU中不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个带Radeon R7 GPU的AMD A10 APU.我相信该设备支持并发内核执行.但是,当我编写以下代码并获得性能分析信息时,内核似乎并没有同时执行.我的openCL代码如下(将每个迭代中的内核添加到同一队列,将不同迭代中的内核添加到不同的队列,因此应该并行运行).

I have an AMD A10 APU with Radeon R7 GPU. I believe this device supportes concurrent kernel execution. But when i wrote the following code and obtained profiling information it doesnt seem like the kernels are executing concurrently. My openCL code is given below (The kernels within each iteration is added to the same queue and kernels in different iteration are added to different queues and hence should be running in parallel).

for(j = 0; j < 8; j++){
     cl_err = clEnqueueNDRangeKernel(queue[4 + j],kernel[Q6_PROGRAM_ID][FILTER1_KERNEL],1,NULL,&globalSize,&localSize,4,eventList,&eventList[4 + j * 4]); //Invoking the first filter kernel
  cl_err = clEnqueueNDRangeKernel(queue[4 + j],kernel[Q6_PROGRAM_ID][FILTER2_KERNEL],1,NULL,&globalSize,&localSize,1,eventList + 4 + 4 * j,&eventList[5 + j * 4]); //Invoking the second filter kernel
  cl_err = clEnqueueNDRangeKernel(queue[4 + j],kernel[Q6_PROGRAM_ID][FILTER3_KERNEL],1,NULL,&globalSize,&localSize,1,eventList + 5 + 4 * j,&eventList[6 + j * 4]); //Invoking the third filter kernel
  cl_err = clEnqueueNDRangeKernel(queue[4 + j],kernel[Q6_PROGRAM_ID][AGGREGATE_KERNEL],1,NULL,&globalSize,&localSize,1,eventList + 6 + 4 * j,&eventList[7 + j * 4]); //Invoking the aggregate kernel
}

我用于剖析的代码是:

for(j = 0; j < 8; j++){
  //Code for obtaining the profiling data
  clWaitForEvents(4 + 4*j, eventList+4);
  clGetEventProfilingInfo(eventList[4 + j * 4], CL_PROFILING_COMMAND_QUEUED, sizeof(time_start_queued), &time_start_queued, NULL);
  clGetEventProfilingInfo(eventList[4 + j * 4], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
  clGetEventProfilingInfo(eventList[4 + j * 4], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
  total_time = ((double)time_end - time_start)/1000000;
  total_time_queued = ((double)time_end - time_start_queued)/1000000;
  final_time += total_time;
  final_time_queued += total_time_queued;


  cout<<"\n1 : "<<time_start<<" "<<time_end;


  clGetEventProfilingInfo(eventList[5 + j * 4], CL_PROFILING_COMMAND_QUEUED, sizeof(time_start_queued), &time_start_queued, NULL);
  clGetEventProfilingInfo(eventList[5 + j * 4], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
  clGetEventProfilingInfo(eventList[5 + j * 4], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
  total_time = ((double)time_end - time_start)/1000000;
  total_time_queued = ((double)time_end - time_start_queued)/1000000;
  final_time += total_time;
  final_time_queued += total_time_queued;


  cout<<"\n2 : "<<time_start<<" "<<time_end;


  clGetEventProfilingInfo(eventList[6 + j * 4], CL_PROFILING_COMMAND_QUEUED, sizeof(time_start_queued), &time_start_queued, NULL);
  clGetEventProfilingInfo(eventList[6 + j * 4], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
  clGetEventProfilingInfo(eventList[6 + j * 4], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
  total_time = ((double)time_end - time_start)/1000000;
  total_time_queued = ((double)time_end - time_start_queued)/1000000;
  final_time += total_time;
  final_time_queued += total_time_queued;


  cout<<"\n3 : "<<time_start<<" "<<time_end;


  clGetEventProfilingInfo(eventList[7 + j * 4], CL_PROFILING_COMMAND_QUEUED, sizeof(time_start_queued), &time_start_queued, NULL);
  clGetEventProfilingInfo(eventList[7 + j * 4], CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
  clGetEventProfilingInfo(eventList[7 + j * 4], CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
  total_time = ((double)time_end - time_start)/1000000;
  total_time_queued = ((double)time_end - time_start_queued)/1000000;
  final_time += total_time;
  final_time_queued += total_time_queued;


  cout<<"\n4 : "<<time_start<<" "<<time_end;
  }

我的分析代码输出为:

1:3989633359630 3989657015190

1 : 3989633359630 3989657015190

2:3989657016860 3989683273450

2 : 3989657016860 3989683273450

3:3989683275090 3989708840030

3 : 3989683275090 3989708840030

4:3989708841760 3989734915610

4 : 3989708841760 3989734915610

1:3989800219990 3989824648510

1 : 3989800219990 3989824648510

2:3989824650240 3989850888860

2 : 3989824650240 3989850888860

3:3989850890610 3989876392210

3 : 3989850890610 3989876392210

4:3989876393890 3989902432920

4 : 3989876393890 3989902432920

1:3989954275546 3989978865766

1 : 3989954275546 3989978865766

2:3989978867476 3990005037296

2 : 3989978867476 3990005037296

3:3990005038976 3990030592876

3 : 3990005038976 3990030592876

4:3990030594566 3990056566896

4 : 3990030594566 3990056566896

1:3990113144067 3990137315217

1 : 3990113144067 3990137315217

2:3990137316937 3990163458337

2 : 3990137316937 3990163458337

3:3990163460057 3990189007267

3 : 3990163460057 3990189007267

4:3990189008967 3990215129227

4 : 3990189008967 3990215129227

1:3990274589700 3990299102730

1 : 3990274589700 3990299102730

2:3990299104430 3990325570980

2 : 3990299104430 3990325570980

3:3990325572730 3990351050810

3 : 3990325572730 3990351050810

4:3990351052550 3990377255070

4 : 3990351052550 3990377255070

1:3990424871514 3990448828874

1 : 3990424871514 3990448828874

2:3990448830524 3990475309034

2 : 3990448830524 3990475309034

3:3990475310744 3990500849914

3 : 3990475310744 3990500849914

4:3990500851664 3990526839444

4 : 3990500851664 3990526839444

1:3990584574567 3990608802017

1 : 3990584574567 3990608802017

2:3990608803727 3990635102497

2 : 3990608803727 3990635102497

3:3990635104427 3990660647987

3 : 3990635104427 3990660647987

4:3990660649697 3990686716887

4 : 3990660649697 3990686716887

1:3990733269328 3990757174868

1 : 3990733269328 3990757174868

2:3990757176588 3990783429448

2 : 3990757176588 3990783429448

3:3990783431118 3990809003598

3 : 3990783431118 3990809003598

4:3990809005298 3990835207128

4 : 3990809005298 3990835207128

推荐答案

我弄清楚了为什么内核不能同时执行.对于并发执行,GPU应该有足够的可用资源来支持并发执行.在我的代码中,每个内核的工作组数量非常多,因此GPU仅具有足够的资源来支持1个并发内核.因此,请尽量减少工作组的数量以允许并发执行

I figured out why the kernels are not executing concurrently. For concurrent execution the GPU should have enough free resources available to support concurrent execution. In my code the number of work groups of each kernel was quite high and so the GPU only had enough resources to support 1 concurrent kernel. So keep the number of workgroups to a minimum to allow concurrent execution

这篇关于并发内核执行在AMD A10 APU中不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆