OpenCL内核"for-loop"中的变量会降低性能 [英] Variable in OpenCL kernel 'for-loop' reduces performance

查看:100
本文介绍了OpenCL内核"for-loop"中的变量会降低性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的内核中有一个for循环,我对其进行了硬编码以迭代一定数量的代码循环:

I have a for-loop in my kernel that I had hard-coded to iterate for a fixed number of loops of my code:

for (int kk = 0; kk < 50000; kk++)
{
  <... my code here ...>
}

我认为循环中的代码与我的问题无关,这是一些非常简单的表查找和整数数学运算.

I don't think the code in the loop is relevant to my question, it's some pretty simple table look-ups and integer math.

我想让我的内核代码更具​​灵活性,所以我修改了循环,以便用内核输入参数'num_loops'代替循环(50000)的迭代次数.

I wanted to make my kernel code a little more flexible so I modified the loop so that the number of iterations of my loop (50000) is replaced with a kernel input parameter 'num_loops'.

for (int kk = 0; kk < num_loops; kk++)
{
  <... more code here ...>
}

我发现,即使我的主机程序使用

The thing I found is that even when my host program calls the kernel with

num_loops = 50000 

与先前的硬编码值相同,内核的性能几乎降低了一半.

which is the same value as the previously hard-coded value, the performance of my kernel is cut almost in half.

我正在尝试找出导致性能下降的原因.我想这与OpenCL编译器无法有效展开循环有关吗?

I'm trying to figure out what is causing the performance degradation. I imagine it has something to do with the OpenCL compiler not being able to efficiently unroll the loop?

有没有一种方法可以做我想做的事情而不会造成性能损失?

Is there a way to do what I'm trying to do without incurring the performance penalty?

更新:以下是"#pragma展开"的一些结果

UPDATE: Here are some results from playing with "#pragma unroll"

不幸的是,展开循环似乎无法解决我的性能问题.

Unfortunately, it seems that unrolling the loops doesn't solve my performance issues.

即使展开硬编码循环也会降低性能.

Even unrolling the hard-coded loop degrades performance.

这是带有硬编码值(最佳性能)的普通循环:

Here's the normal loop with the hard-coded value (best performance):

for (int kk = 0; kk < 50000; kk++)
// Time to execute = 0.18 (40180 Mi ops/sec)

如果我展开循环,情况会变得更糟:

If I unroll the loop, things get worse:

#pragma unroll
// or #pragma unroll 50000
for (int kk = 0; kk < 50000; kk++)
// Time to execute = 0.22 (33000 Mi ops/sec)

这里是使用变量num_loops = 50000的循环:

Here's the loop that uses a variable, num_loops = 50000:

for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.26 (27760 Mi ops/sec)

#pragma unroll 50000
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.26 (27760 Mi ops/sec)

#pragma unroll
for (int kk = 0; kk < num_loops; kk++)
// Time to execute = 0.24 (30280 Mi ops/sec)

将num_loops变量与直接的"#pragma展开"结合使用时,情况确实会好一些,但是即使性能仍然比硬编码的展开版本慢25%.

Things do get a little better when using the num_loops variable with the straight "#pragma unroll", however even that performance is still about 25% slower than the hard-coded, unrolled version.

关于如何将num_loops用作循环变量而不引起性能损失的任何其他想法?

Any other ideas on how to use num_loops as the loop variable without incurring a performance hit?

推荐答案

是的,性能降低的最可能原因是编译器无法展开循环.您可以尝试一些方法来改善这种情况.

Yes, the most likely cause of the performance degradation is that the compiler can't unroll the loop. There's a few things you could try to improve the situation.

您可以将参数定义为通过程序构建选项传递的预处理器宏.这是一种常见的技巧,用于建立仅在运行时才作为编译时常数在内核中已知的值.例如:

You could define the parameter as a preprocessor macro passed via your program build options. This is a common trick used to build values that are only known at runtime into kernels as compile-time constants. For example:

clBuildProgram(program, 1, &device, "-Dnum_loops=50000", NULL, NULL);

您可以使用sprintf动态地构建构建选项,以使其更加灵活.显然,只有在不需要经常更改参数的情况下,这样做才是值得的,这样重新编译的开销就不会成为问题.

You could construct the build options dynamically using sprintf to make this more flexible. Clearly this will only be worth it if you don't need to change the parameter often, so that the overhead of recompilation doesn't become a problem.

您可以调查您的OpenCL平台是否使用了任何编译指示,这些编译指示可以为编译器提供有关循环展开的提示.例如,某些OpenCL编译器可以识别#pragma unroll(或类似名称). OpenCL 2.0为此具有一个属性:__attribute__((opencl_unroll_hint)).

You could investigate whether your OpenCL platform uses any pragmas that can give the compiler hints about loop-unrolling. For example, some OpenCL compilers recognise #pragma unroll (or similar). OpenCL 2.0 has an attribute for this: __attribute__((opencl_unroll_hint)).

您可以手动展开循环.它的外观取决于您可以对num_loops参数做出什么样的假设.例如,如果您知道(或可以确保)它将始终是4的倍数,则可以执行以下操作:

You could manually unroll the loop. How this would look depends on what assumptions you can make about the num_loops parameter. For example, if you know (or can ensure) that it will always be a multiple of 4, you could do something like this:

for (int kk = 0; kk < num_loops;)
{
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
  <... more code here ...>
  kk++;
}

即使您不能做出这样的假设,您仍然应该能够执行手动展开,但是这可能需要做一些额外的工作(例如,完成所有剩余的迭代).

Even if you can't make such assumptions, you should still be able to perform manual unrolling, but it may require some extra work (for example, to finish any remaining iterations).

这篇关于OpenCL内核"for-loop"中的变量会降低性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆