GPGPU编程仅允许执行SIMD指令吗? [英] Does GPGPU programming only allow the execution of SIMD instructions?

查看:166
本文介绍了GPGPU编程仅允许执行SIMD指令吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

GPGPU编程仅允许执行SIMD指令吗?
如果是这样,那么重新编写一个
设计为在通用CPU上运行以在GPU上运行的算法,一定是一项繁琐的任务?算法中是否还有可以转换为SIMD架构的
模式?

解决方案

嗯,这不太准确GPGPU仅支持SIMD执行。许多GPU具有一些非SIMD组件。但是,总的来说,要充分利用GPU,您需要运行SIMD代码。



但是,并不一定要编写SIMD指令。即GPU SIMD与CPU SIMD 不相同-即与编写代码以利用x86 SSE(Stream SIMD扩展)等不同。事实上,作为支持CPU SIMD的人之一你们(我最早参与了Intel MMX的开发,其中最早的一种,并且一直沿用FP SIMD的演变),我常常不得不纠正那些认为像Intel这样的CPU具有SIMD指令的人。我更喜欢将它们视为打包矢量指令,尽管我勉强称其为SIMD打包矢量指令集,只是因为每个人都滥用该名称。我还强调,CPU SIMD指令集(例如MMX和SSE)可能具有SIMD打包向量执行单元-整数和浮点ALU等-但它们没有SIMD控制流,并且通常没有SIMD内存访问权限(aka分散/聚集(尽管Intel Larrabee朝着这个方向发展)。)



comp-arch.net Wiki上的某些页面对此有所介绍(我写了有关计算机体系结构的文章)为我的爱好):
- http://semipublic.comp-arch.net/wiki / SIMD
- http://semipublic.comp-arch.net/ Wiki / SIMD_packed_vector
- http://semipublic.comp-arch.net / wiki / Difference_between_vector_and_packed_vector
- http://semipublic.comp-arch.net/wiki/Single_Instruction_Multiple_Threads_(SIMT)
,尽管我很抱歉尚未编写有关SIMD的网页向量指令序列,例如在Intel MMX或SIMD中。



但是我不希望您阅读以上所有内容。让我尝试解释一下。



想象一下,当您以简单的标量方式编写代码时,看起来像这样:

  //在具有一百万个32b浮点元素A [1000000] 
for i从0到999999的数组上进行操作
如果some_condition(A [i])则
A [i] = function1(A [i])
else
A [i] = function2(A [i])

其中function1()和function2()很简单,可以内联-例如function1(x)= x * x和function2(x)= sqrt(x)。



在CPU上。要使用SSE之类的东西,您必须(1)将数组分成多个块,例如256位AVX的大小,(2)自己使用掩码等处理IF语句。像这样的东西:

  for i从0到999999 by 8 do 
register tmp256b_1 = load256b(& A [i ])
寄存器tmp256b_2 = tmp256b_1 * tmp256b_1
寄存器tmp256b_3 = _mm_sqrt_ps(tmp256b_1)//这是一个内在的
//一个函数,可能是内联的
//正在执行牛顿·拉夫森(Newton Raphson)评估sqrt。
寄存器mask256256 = ...的代码,可让您在车道
中拥有32个1,其中some_condition为true,其他位置为0 ...
寄存器tmp256b_4 =(tmp256b_1&面具)| (tmp256b_3 |〜mask);
store256b(& A [i],tmp256b_4)

您可能不认为这是太糟糕了,但是请记住,这是一个简单的例子。想象一下多个嵌套的IF,依此类推。或者,假设 some_condition很笨重,那么您可以通过跳过全部为function1或全部function2的节来节省很多不必要的计算。

  for i从0到999999 by 8 do 
register mask256b = ...代码,安排您在车道
中有32个1,其中some_condition为true,在其他地方为0s ...
寄存器tmp256b_1 = load256b(A [i])
如果mask256b ==〜0则
寄存器tmp256b_2 = tmp256b_1 * tmp256b_1
store256b(& A [i],tmp256b_2)
else mask256b == 0然后
寄存器tmp256b_3 = _mm_sqrt_ps(tmp256b_1)//这是一个内在
store256b(& A [i],tmp256b_3)
else
寄存器tmp256b_1 = load256b(& A [i])
寄存器tmp256b_2 = tmp256b_1 * tmp256b_1
寄存器tmp256b_3 = _mm_sqrt_ps(tmp256b_1)
寄存器tmp256b_4 =( tmp256b_1& mask)| (tmp256b_3 |〜mask);
store256b(& A [i],tmp256b_4)

我想您可以图片?当您有多个数组时,情况变得更加复杂,有时数据在256位边界上对齐,有时甚至没有(按照通常在模板计算中进行所有对齐的操作)。

现在,大致类似于GPU之类的样子:

  / /在具有一百万个32b浮点元素A [1000000] 
的数组上进行操作(对于所有i从0到999999),如果some_condition(A)则
A = function1(A)
else
A = function2(A)

不是很多吗像原始的标量代码一样?唯一的实际区别是您丢失了数组索引A [i]。 (实际上,有些GPGPU语言保留了数组索引,但是我所知道的大多数语言都没有。)



现在,我已经省略了(a)Open / CL的C类似于语法,(b)将Open / CL代码连接到C或C ++代码所需的所有设置(比CUDA或OpenCL更好的语言-这些语言很多,但是它们可用很多放在CPU和GPU [**]上)。但是我想我已经介绍了问题的核心:



GPGPU计算的关键是编写SIMD,数据并行处理。但是,您编写的级别比编写CPU风格的SSE代码的级别更高。



首先是GPGPU编译器,例如OpenCL或CUDA编译器,可以处理很多后台数据管理。编译器安排执行控制流,IF语句等。



顺便说一句,请注意,正如我用[**]标记的那样,有时这样称为SIMD的GPGPU编译器可以生成可同时在CPU和GPU上运行的代码。即SIMD编译器可以生成使用CPU SIMD指令集的代码。



但是GPU本身具有特殊的硬件支持,可以正确地运行此SIMD代码,其运行速度比其运行速度快得多。使用CPU SIMD指令在CPU上运行。最重要的是,GPU具有更多执行单元-例如像AMD Bulldoser这样的CPU具有2组128位宽的FMACS,即每个周期能够执行8个FMAC。将芯片上的CPU数量乘以8(例如8),每个周期可能给您64个。而现代GPU每个周期可能有2,048个32b FMAC。即使以1/2或1/4的时钟频率运行,这也是一个很大的差异。



GPU如何拥有更多的硬件?好吧,首先,它们通常是比CPU大的芯片。但是,他们也往往不会在硬件上花费(有人说是浪费)硬件,例如大型缓存和CPU在上面花费的乱序执行。 CPU尝试快速执行一个或几个计算,而GPU并行执行许多计算,但分别比CPU慢。尽管如此,GPU每秒可进行的计算总数仍远高于CPU能够进行的计算。



FGPU具有其他硬件优化功能。例如,它们运行的​​线程多于CPU。一个Intel CPU每个CPU有2个超线程,在8个CPU核心芯片上为您提供16个线程,而GPU可能有数百个。等等。



作为计算机架构师,对我而言最有趣的是,许多GPU都对SIMD控制流提供了特殊的硬件支持。它们使操作这些掩码比在运行SSE的CPU上效率更高。



等等。






无论如何,我希望我已阐明我的观点




  • 而你 必须编写SIMD代码才能在GPGPU系统(例如OpenCL)上运行。


  • 您不应将此类SIMD与SIMD混淆您必须编写代码才能利用Intel SSE。




这要干净得多。



越来越多的编译器允许相同的代码在DCPU和GPU上运行。即他们越来越支持纯净的真实SIMD编码样式,而不是伪造的伪SIMD编码样式,直到现在为止,伪造的伪SIMD编码样式对于利用MMX,SSE和AVX都是必需的。很好-这种代码在CPU和GPU上编程同样不错。但是GPU通常会运行得更快。英特尔有一篇名为揭穿100X GPU与CPU神话:对CPU和GPU上的吞吐量计算进行评估的论文, http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf 。它说GPU平均仅快 2.5倍。但这是经过大量积极优化之后的结果。 GPU代码通常更容易编写。而且我不了解您,但是我认为仅 2.5倍的速度就不足为奇了。尤其是因为GPGPU代码通常更易于阅读。



现在,没有免费的午餐了。如果您的代码自然是数据并行的,那就太好了。但是有些男女同志却不是。



和所有机器一样,GPU也有其怪癖。



代码自然是与数据并行的,因此可以提高代码的可读性。



我是CPU设计人员。我希望从GPU借用很多想法,使雄性CPU的运行速度更快,反之亦然。


Does GPGPU programming only allow the execution of SIMD instructions? If so then it must be a tedious task to re write an algorithm that has been designed to run on a general CPU to run on a GPU? Also is there a pattern in algorithms that can be converted to SIMD architecture?

解决方案

Well, it's not quite exact that GPGPU only supports SIMD execution. Many GPUs have some non-SIMD components. But, overall, to take full advantage of a GPU you need to be running SIMD code.

However, you are NOT necessarily writing SIMD instructions. I.e. GPU SIMD is not the same as CPU SIMD - i.e. not the same as writing code to take advantage of x86 SSE (Stream SIMD Extensions), etc. Indeed, as one of the people who brough CPU SIMD to you (I was heavily involved in Intel MMX, one of the earliest such, and have followed the evolution to FP SIMD) I often feel obliged to correct people who say that CPU's like Intel have SIMD instructions. I prefer to consider them packed vector instructions, although I grudgingly call them SIMD packed vector instruction sets just because everyone misuses the name. I also emphasize that CPU SIMD isntruction sets such as MMX and SSE may have SIMD packed vector execution units - integer and floating point ALUs, etc. - but they don't have SIMD control flow, and they usually don't have SIMD memory access (aka scatter/gather (although Intel Larrabee was moving in that direction)).

Some pages on my comp-arch.net wiki about this (I write about computer architecture for my hobby): - http://semipublic.comp-arch.net/wiki/SIMD - http://semipublic.comp-arch.net/wiki/SIMD_packed_vector - http://semipublic.comp-arch.net/wiki/Difference_between_vector_and_packed_vector - http://semipublic.comp-arch.net/wiki/Single_Instruction_Multiple_Threads_(SIMT) although I apologize for not yet having written the page that talks about SIMD packed vector instruction sers, as in Intel MMX or SIMD.

But I don't expect you to read all of the above. Let me try to explain.

Imagine that you have a piece of code that looks something like this, when written in a simple, scalar, manner:

// operating on an array with one million 32b floating point elements A[1000000]
for i from 0 upto 999999 do
     if some_condition(A[i]) then
           A[i] = function1(A[i])
     else
           A[i] = function2(A[i])

where function1() and function2() are simple enough to inline - say function1(x) = x*x and function2(x) = sqrt(x).

On a CPU. to use something like SSE, you would have to (1) divide the array up into chunks, say the size of the 256 bit AVX , (2) handle the IF statement yourself, using masks or the like. Something like:

for i from 0 upto 999999 by 8 do
     register tmp256b_1 = load256b(&A[i])
     register tmp256b_2 = tmp256b_1 * tmp256b_1
     register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
                                                 // a function, possibly inlined
                                                 // doing a Newton Raphson to evaluate sqrt.
     register mask256b = ... code that arranges for you to have 32 1s in the "lane" 
                         where some_condition is true, and 0s elsewhere...
     register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
     store256b(&A[i],tmp256b_4)

You may not think this is so bad, but remember, this is a simple example. Imagine multiple nested IFs, and so on. Or, imagine that "some_condition" is clumpy, so that you might save a lot of unnecessary computation by skipping sections where it is all function1 or all function2...

for i from 0 upto 999999 by 8 do
     register mask256b = ... code that arranges for you to have 32 1s in the "lane" 
                         where some_condition is true, and 0s elsewhere...
     register tmp256b_1 = load256b(A[i])
     if mask256b == ~0 then
         register tmp256b_2 = tmp256b_1 * tmp256b_1
         store256b(&A[i],tmp256b_2)
     else mask256b == 0 then
         register tmp256b_3 = _mm_sqrt_ps(tmp256b_1) // this is an "intrinsic"
         store256b(&A[i],tmp256b_3)
     else
         register tmp256b_1 = load256b(&A[i])
         register tmp256b_2 = tmp256b_1 * tmp256b_1
         register tmp256b_3 = _mm_sqrt_ps(tmp256b_1)
         register tmp256b_4 = (tmp256b_1 & mask) | (tmp256b_3 | ~mask);
         store256b(&A[i],tmp256b_4)

I think you can get the picture? And it gets even more complicated when you have multiple arrays, and sometimes the data is aligned on a 256 bit boundary, and sometimes not (as is typical, say, in stencil computations, where you operate on all alignments).

Now, here's roughly what it looks like on something like a GPU:

// operating on an array with one million 32b floating point elements A[1000000]
for all i from 0 upto 999999 do
     if some_condition(A) then
           A = function1(A)
     else
           A = function2(A)

Doesn't that look a lot more like the original scalar code? The only real difference is that you have lost the array indexes, A[i]. (Actually, some GPGPU languages keep the array indexes in, but most that I know of do not.)

Now, I have left out (a) Open/CL's C-like syntax, (b) all of the setup that you need to connect the Open/CL code to your C or C++ code (there are much better languages than CUDA or OpenCL - these have a lot of cruft. But they are available many places, on both CPUs and GPUs[**]). But I think I have presented the heart of the matter:

The key thing about GPGPU computation is that you write SIMD, data parallel cold. But you write it at a higher level than you write CPU-style SSE code. Higher level even than the compiler intrinsics.

First, the GPGPU compiler, e.g. the OpenCL or CUDA compiler, handle a lot of the management of data behind your back. The compiler arranges to do the control flow, tghe IF statements, etc.

By the way, note, as I marked with a [**], that sometimes a so called SIMD GPGPU compiler can generate code that wil run on both CPUs and GPUs. I.e. a SIMD compiler can generate code that uses CPU SIMD instructiin sets.

But GPUs themselves have special hardware support that runs this SIMD code, appropriately compiled, much faster than it can run on the CPU using CPU SIMD instructions. Most importantly, the GPUs have many more execution units - e.g. a CPU like AMD Bulldoser has 2 sets of 128-bit-wide FMACS, i.e. is capable of doing 8 FMACs per cycle. Times the number of CPUs on a chip - say 8 - giving you maybe 64 per cycle. Whereas a modern GPU may have 2,048 32b FMACs every cycle. Even if running at 1/2 or 1/4 the clock rate, that's a big difference.

How can the GPUs have so much more hardware? Well, first, they are usually bigger chips than the CPU. But, also, they tend not to spend (some say "waste") hardware on things like big caches and out-of-order execution that CPUs spend it on. CPUs try to make one or a few computations fast, whereas GPUs do many computations in parallel, but individually slower than the CPU. Still, the total number of computations that the GPU can do per second is much higher than a CPU can do.

The FGPUs have other hardware optimizations. For example, they run many more threads than a CPU. Whereas an Intel CPU has 2 hyperthreads per CPU, giving you 16 threads on an 8 CPU core chip, a GPU may have hundreds. And so on.

Most interesting to me as a computer architect, many GPUs have special hardware support for SIMD control flow. They make manipulating those masks much more efficient than on a CPU running SSE.

And so on.


Anyway, I hope that I have made my point

  • While you do have to write SIMD code to run on a GPGPU system (like OpenCL).

  • You should not confuse this sort of SIMD with the SIMD code you have to write to take advantage of Intel SSE.

It's much cleaner.

More and more compilers are allowing the same code to run on both DCPU and GPU. I.e. they are increasingly supporting the clean "real SIMD" coding style, rather than the fake "pseudo-SIMD" coding style that has been necessary to take advantage of MMX and SSE and AVX up til now. This is good - such code is equally "nice" to program on both CPU and GPU. But the GPU often runs it much faster. There's a paper by Intel called "Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU", http://www.hwsw.hu/kepek/hirek/2010/06/p451-lee.pdf. It says GPUs are "only" 2.5X faster on average. But that's after a lot of aggressive optimization. The GPU code is often easier to write. And I don't know about you, but I think "only" 2.5X faster is not that much to sneeze at. Especially since the GPGPU code is often easier to read.

Now, there's no free lunch. If your code is naturally data parallel, great. But some coede is not. That can be a pain.

And, like all machines, GPUs have their quirks.

But if your code is naturally data parallel, you may get great speedups, with code that is much more readable.

I'm a CPU designer. I expect to borrow lots of ideas from GPUs to male CPUs run faster, and vice versa.

这篇关于GPGPU编程仅允许执行SIMD指令吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆