p_f_e内的向量化循环 [英] Vectorization loop inside the p_f_e

查看:98
本文介绍了p_f_e内的向量化循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

比方说,我希望10个线程在内核主体中执行相同的循环.

Let's say that I want 10 threads to execute the same loop inside the kernel body.

通常,此循环在CPU执行操作时矢量化.
内核体内的同一循环在传递给GPU时是否会向量化?

Normally this loop vectorizes when CPU performs the operation.
Does this same loop inside the kernel body vectorizes when its passed to GPU?

parallel_for_each(ext, [=] (index<1> idx) restrict(amp) 
{
     for (int i = 0; i < 1000; i++) 
     {
         A[i] = B[i] + C[i];
     }
}


推荐答案

C ++ AMP编译器将HLSL字节码作为目标,并尝试进行矢量化.另外,如果编译器未向量化,则GPU驱动程序在降低对本机硬件的指导时,如果对目标有意义,则很可能会这样做 GPU架构.

The C++ AMP compiler targets HLSL bytecode and does attempt vectorization. Also, if the compiler does not vectorize, the GPU driver is likely to do it when lowering to the native hardware instructions if it makes sense for the target GPU architecture.

话虽如此,大多数现代GPU都是标量架构,矢量化的短向量代码没有性能上的好处.

Having said that, most modern GPUs are scalar architectures and there are no performance benefits from vectorized short vector code.

-授予


这篇关于p_f_e内的向量化循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆