我的opencl测试没有比CPU快得多 [英] My opencl test does not run much faster than CPU

查看:46
本文介绍了我的opencl测试没有比CPU快得多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试测量GPU的执行时间,并将其与CPU进行比较. 我编写了一个simple_add函数来添加短整数向量的所有元素. 内核代码为:

I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:

global const int * A, global const uint * B, global int* C)
    {
        ///------------------------------------------------
        /// Add 16 bits of each
        int AA=A[get_global_id(0)];
        int BB=B[get_global_id(0)];
        int AH=0xFFFF0000 & AA;
        int AL=0x0000FFFF & AA;
        int BH=0xFFFF0000 & BB;
        int BL=0x0000FFFF & BB;
        int CL=(AL+BL)&0x0000FFFF;
        int CH=(AH+BH)&0xFFFF0000;      
        C[get_global_id(0)]=CH|CL;               
     }

我为此功能编写了另一个CPU版本,并在执行了100次时间后测量了它们的执行时间

I wrote another CPU version for this function and after 100 time executions measured their execution time

clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
  queue.enqueueNDRangeKernel(kernel_add,1,
  cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
  queue.finish();
 }
 clock_t after_GPU = clock();


 clock_t before_CPU = clock();
 for(int i=0;i<100;i++)
     AddImagesCPU(A,B,C);
  clock_t after_CPU = clock();

调用整个测量函数10次后结果如下:

the result was as below after 10 times calling the whole measurement function:

        CPU time: 1359
        GPU time: 1372
        ----------------
        CPU time: 1336
        GPU time: 1269
        ----------------
        CPU time: 1436
        GPU time: 1255
        ----------------
        CPU time: 1304
        GPU time: 1266
        ----------------
        CPU time: 1305
        GPU time: 1252
        ----------------
        CPU time: 1313
        GPU time: 1255
        ----------------
        CPU time: 1313
        GPU time: 1253
        ----------------
        CPU time: 1384
        GPU time: 1254
        ----------------
        CPU time: 1300
        GPU time: 1254
        ----------------
        CPU time: 1322
        GPU time: 1254
        ----------------

问题是我确实希望GPU比CPU快得多,但事实并非如此.我不明白为什么我的GPU速度没有比CPU高很多.我的代码有什么问题吗? 这是我的GPU属性:

The problem is that I really expected GPU to be much faster than CPU but it was not. I can't understand why my GPU speed is not much higher than CPU. Is there any problem in my codes ?? Here is my GPU properties:

        -----------------------------------------------------
        ------------- Selected Platform Properties-------------:
        NAME:   AMD Accelerated Parallel Processing
        EXTENSION:      cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
        VENDOR:         Advanced Micro Devices, Inc.
        VERSION:        OpenCL 1.2 AMD-APP (937.2)
        PROFILE:        FULL_PROFILE
        -----------------------------------------------------
        ------------- Selected Device Properties-------------:
        NAME :  ATI RV730
        TYPE :  4
        VENDOR :        Advanced Micro Devices, Inc.
        PROFILE :       FULL_PROFILE
        VERSION :       OpenCL 1.0 AMD-APP (937.2)
        EXTENSIONS :    cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
        MAX_COMPUTE_UNITS :     8
        MAX_WORK_GROUP_SIZE :   128
        OPENCL_C_VERSION :      OpenCL C 1.0
        DRIVER_VERSION:         CAL 1.4.1734
        ==========================================================

为了比较这是我的CPU规格:

and just to compare this is my CPU specifications:

        ------------- CPU Properties-------------:
        NAME :          Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
        TYPE :  2
        VENDOR :        GenuineIntel
        PROFILE :       FULL_PROFILE
        VERSION :       OpenCL 1.2 AMD-APP (937.2)
        MAX_COMPUTE_UNITS :     4
        MAX_WORK_GROUP_SIZE :   1024
        OPENCL_C_VERSION :      OpenCL C 1.2
        DRIVER_VERSION:         2.0 (sse2,avx)
        ==========================================================

我还使用QueryPerformanceCounter测量了挂钟时间,这是结果:

I also measured the wall clock time using QueryPerformanceCounter and here is the results:

            CPU time: 1304449.6  micro-sec
            GPU time: 1401740.82  micro-sec
            ----------------------
            CPU time: 1620076.55  micro-sec
            GPU time: 1310317.64  micro-sec
            ----------------------
            CPU time: 1468520.44  micro-sec
            GPU time: 1317153.63  micro-sec
            ----------------------
            CPU time: 1304367.29  micro-sec
            GPU time: 1251865.14  micro-sec
            ----------------------
            CPU time: 1301589.17  micro-sec
            GPU time: 1252889.4  micro-sec
            ----------------------
            CPU time: 1294750.21  micro-sec
            GPU time: 1257017.41  micro-sec
            ----------------------
            CPU time: 1297506.93  micro-sec
            GPU time: 1252768.9  micro-sec
            ----------------------
            CPU time: 1293511.29  micro-sec
            GPU time: 1252019.88  micro-sec
            ----------------------
            CPU time: 1320753.54  micro-sec
            GPU time: 1248895.73  micro-sec
            ----------------------
            CPU time: 1296486.95  micro-sec
            GPU time: 1255207.91  micro-sec
            ----------------------

同样,我尝试使用opencl分析进行执行.

Again I tried the opencl profiling for execution time.

            queue.enqueueNDRangeKernel(kernel_add,1,
                                    cl::NDRange((size_t)(NumberOfAllElements/4)),
                                    cl::NDRange(64),NULL,&ev);
            ev.wait();
            queue.finish();
            time_start=ev.getProfilingInfo<CL_PROFILING_COMMAND_START>();
            time_end=ev.getProfilingInfo<CL_PROFILING_COMMAND_END>();

一次执行的结果大致相同:

Results for one time execution were more or less the same:

            CPU time: 13335.1815  micro-sec
            GPU time: 11865.111  micro-sec
            ----------------------
            CPU time: 13884.0235  micro-sec
            GPU time: 11663.889  micro-sec
            ----------------------
            CPU time: 19724.7296  micro-sec
            GPU time: 14548.222  micro-sec
            ----------------------
            CPU time: 19945.3199  micro-sec
            GPU time: 15331.111  micro-sec
            ----------------------
            CPU time: 17973.5055  micro-sec
            GPU time: 11641.444  micro-sec
            ----------------------
            CPU time: 12652.6683  micro-sec
            GPU time: 11632  micro-sec
            ----------------------
            CPU time: 18875.292  micro-sec
            GPU time: 14783.111  micro-sec
            ----------------------
            CPU time: 32782.033  micro-sec
            GPU time: 11650.444  micro-sec
            ----------------------
            CPU time: 20462.2257  micro-sec
            GPU time: 11647.778  micro-sec
            ----------------------
            CPU time: 14529.6618  micro-sec
            GPU time: 11860.112  micro-sec

推荐答案

ATI RV730具有VLIW结构,因此最好尝试使用总线程数为1/4的uint4int4向量类型(即NumberOfAllElements/16).这也有助于更快地从内存中加载每个工作项.

ATI RV730 has VLIW structure so it is better to try uint4 and int4 vector types with 1/4 number of total threads (which is NumberOfAllElements/16). This would also help loading from memory faster for each work item.

与内存操作相比,内核的计算量也不大.使缓冲区映射到RAM将具有更好的性能.不要复制数组,请使用map/unmap enqueue命令将它们映射到内存.

Also kernel doesn't have much calculations compared to memory operations. Making buffers mapped to RAM would have better performance. Don't copy arrays, map them to memory using map/unmap enqueue commands.

如果仍不能更快,您可以同时使用gpu和cpu进行上半部分和下半部分的工作,以在%50的时间内完成它.

If its still not faster, you can use both gpu and cpu at the same time to work on first half and second half of work to finish it in %50 time.

也不要将clFinish放入循环中.将其放在循环结束之后.这样,它将更快地入队,并且已经按顺序执行,因此在完成第一个项目之前不会启动其他任务.我想这是有序队列,在每个入队后添加clfinish是额外的开销.仅在最新内核之后执行一次简单操作即可.

Also don't put clFinish in loop. Put it just after the end of loop. This way it will enqueue it much faster and it already has in-order execution so it won't start others before finishing the first item. It is in-order queue I suppose and adding clfinish after each enqueue is extra overhead. Only a single clfinish after latest kernel is enough.

ATI RV730:64个VLIW单元,每个单元至少具有4个流核心. 750 MHz.

ATI RV730: 64 VLIW units, each has at least 4 streaming cores. 750 MHz.

i3-2100:2个内核(仅用于防起泡的线程)每个均具有能够同时流式传输8个操作的AVX.因此,这可能有16项操作正在进行中.超过3 GHz.

i3-2100: 2 cores(threads just for anti-bubbling) each having AVX that capable of streaming 8 operations simultaneously. So this can have 16 operations in flight. More than 3 GHz.

简单地将流操作与频率相乘:

Simply multiplication of streaming operations with frequencies:

ATI RV730 = 192个单位(乘-加功能更多,以每个vliw的第5个元素为准)

ATI RV730 = 192 units (more with multiply-add functions, by 5th element of each vliw)

i3-2100 = 48个单位

i3-2100 = 48 units

因此gpu的速度至少应为4倍(使用int4,uint4).这用于简单的ALU和FPU操作,例如按位运算和乘法. trancandentals性能等特殊功能可能会有所不同,因为它们仅在每个vliw中的第5个单元上运行.

so gpu should be at least 4x as fast(use int4, uint4). This is for simple ALU and FPU operations such as bitwise operations and multiplications. Special functions such as trancandentals performance could be different since they run only on 5th unit in each vliw.

这篇关于我的opencl测试没有比CPU快得多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆