SSE2:如何将数据加载来自非连续的存储单元? [英] SSE2: How To Load Data From Non-Contiguous Memory Locations?

查看:211
本文介绍了SSE2:如何将数据加载来自非连续的存储单元?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想向量化一些非常关键的性能code。在较高的水平,每次循环迭代读取来自非连续位置6个浮子在一小阵列,然后这些值转换为双precision并将它们添加到六个不同的双precision蓄电池。这些累加器是整个迭代一样的,这样他们就可以住在寄存器中。由于该算法的性质,这是不可行的,使存储器访问模式连续的。该阵列是足够小,适合在L1缓存,虽然如此,内存延迟/带宽不是瓶颈。

我愿意用汇编语言或SSE2内在并行这一点。我知道我需要在同一时间到XMM寄存器的两个较低的双字加载两个浮点数,使用 cvtps2pd ,然后将它们添加到两个累加器将它们转换成两个双打在使用时间 addpd

我的问题是,我怎么浮点数到一个单一的XMM寄存器的两个较低的DWORD值,如果他们在内存中不相邻的?显然,这是如此缓慢,这违背了并行化的目的,任何技术是没有用的。在任一ASM或Intel / GCC内在的答案会是AP preciated。

编辑:


  1. float数组的大小,严格地说,在编译的时候不知道,但它的几乎的总是256,所以这可能是特殊的套管。


  2. 浮子阵列应读通过从字节数组加载值来确定的元件。有六个字节数组,每个累加器。从字节数组是顺序,从每个阵列的每个循环迭代的读取,所以不应该有很多缓存未命中那里。


  3. float数组的访问模式是所有的实际目的是随机的。



解决方案

对于这个特定的情况下,看看在你的指令参考手册解包和交错的说明。它会像

  MOVSS XMM0,<&ADDR1 GT;
MOVSS将xmm1,<&ADDR2 GT;
unpcklps XMM0,xmm1中

另外看看 SHUFPS ,只要你有你在错误的顺序需要的数据是得心应手。

I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1 cache, though, so memory latency/bandwidth isn't a bottleneck.

I'm willing to use assembly language or SSE2 intrinsics to parallelize this. I know I need to load two floats at a time into the two lower dwords of an XMM register, convert them to two doubles using cvtps2pd, then add them to two accumulators at a time using addpd.

My question is, how do I get the two floats into the two lower dwords of a single XMM register if they aren't adjacent to each other in memory? Obviously any technique that's so slow that it defeats the purpose of parallelization isn't useful. An answer in either ASM or Intel/GCC intrinsics would be appreciated.

EDIT:

  1. The size of the float array is, strictly speaking, not known at compile time but it's almost always 256, so this can be special cased.

  2. The element of the float array that should be read is determined by loading a value from a byte array. There are six byte arrays, one for each accumulator. The reads from the byte array are sequential, one from each array for each loop iteration, so there shouldn't be many cache misses there.

  3. The access pattern of the float array is for all practical purposes random.

解决方案

For this specific case, take a look at the unpack-and-interleave instructions in your instruction reference manual. It would be something like

movss xmm0, <addr1>
movss xmm1, <addr2>
unpcklps xmm0, xmm1

Also take a look at shufps, which is handy whenever you have the data you want in the wrong order.

这篇关于SSE2:如何将数据加载来自非连续的存储单元?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆