上证所新增中 [英] SSE Loading & Adding

查看:223
本文介绍了上证所新增中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有两个向量,分别由两个类型为double的数组表示,每个数组的大小为2.我想添加相应的位置.因此,假设向量i0i1,我想将i0[0] + i1[0]i0[1] + i1[1]加在一起.

Assume I have two vectors represented by two arrays of type double, each of size 2. I'd like to add corresponding positions. So assume vectors i0 and i1, I'd like to add i0[0] + i1[0] and i0[1] + i1[1] together.

因为类型是double,所以我需要两个寄存器.诀窍是将i0[0]i1[0]以及i0[1]i1[1]放入另一个,然后将寄存器本身添加进去.

Since the type is double, I would need two registers. The trick would be to put i0[0] and i1[0] , and i0[1] and i1[1] in another and just add the register with itself.

我的问题是,如果我依次调用_mm_load_ps(i0[0])_mm_load_ps(i1[0]),会将它们分别放在低64位和高64位中,还是将其替换为第二个load?我如何将两个双打放置在同一个寄存器中,以便之后可以调用add_ps?

My question is, if I call _mm_load_ps(i0[0]) and then _mm_load_ps(i1[0]), will that place them in the lower and upper 64-bits separately, or will it replace the register with the second load? How would I place both doubles in the same register, so I can call add_ps after?

谢谢

推荐答案

我认为您想要的是

double i0[2];
double i1[2];

__m128d x1 = _mm_load_pd(i0);
__m128d x2 = _mm_load_pd(i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now

执行_mm_load_pd时,它将第一个双精度值放入寄存器的低64位,第二个双精度值放入高64位.因此,在上述负载之后,x1会保留两个doublei0[0]i0[1](对于x2类似).对_mm_add_pd的调用会在x1x2中垂直添加相应的元素,因此在添加后,sumi0[0] + i1[0]保留在其低64位中,并将i0[1] + i1[1]保留在其高64位中.

When you do a _mm_load_pd, it puts the first double into the lower 64 bits of the register and the second into the upper 64 bits. So, after the loads above, x1 holds the two double values i0[0] and i0[1] (and similar for x2). The call to _mm_add_pd vertically adds the corresponding elements in x1 and x2, so after the addition, sum holds i0[0] + i1[0] in its lower 64 bits and i0[1] + i1[1] in its upper 64 bits.

我应该指出,使用_mm_load_pd代替_mm_load_ps没有任何好处.正如函数名称所指示的那样,pd变量显式加载两个压缩双精度数,而ps版本则加载四个压缩的单精度浮点数.由于这些纯粹是逐位存储移动,并且它们都使用SSE浮点单元,因此使用_mm_load_ps加载double数据不会有任何损失.而且,_mm_load_ps有一个好处:它的指令编码比_mm_load_pd短一个字节,因此从指令缓存的意义(以及可能的指令解码)来看,它效率更高;我并不是所有复杂情况的专家现代x86处理器).上面使用_mm_load_ps的代码如下所示:

I should point out that there is no benefit to using _mm_load_pd instead of _mm_load_ps. As the function names indicate, the pd variety explicitly loads two packed doubles and the ps version loads four packed single-precision floats. Since these are purely bit-for-bit memory moves and they both use the SSE floating-point unit, there is no penalty to using _mm_load_ps to load in double data. And, there is a benefit to _mm_load_ps: its instruction encoding is one byte shorter than _mm_load_pd, so it is more efficient from an instruction cache sense (and potentially instruction decoding; I'm not an expert on all of the intricacies of modern x86 processors). The above code using _mm_load_ps would look like:

double i0[2];
double i1[2];

__m128d x1 = (__m128d) _mm_load_ps((float *) i0);
__m128d x2 = (__m128d) _mm_load_ps((float *) i1);
__m128d sum = _mm_add_pd(x1, x2);
// do whatever you want to with "sum" now

强制转换没有任何功能;它只是使编译器将SSE寄存器的内容重新解释为保持双精度值而不是浮点数,以便可以将其传递给双精度算术函数_mm_add_pd.

There is no function implied by the casts; it simply makes the compiler reinterpret the SSE register's contents as holding doubles instead of floats so that it can be passed into the double-precision arithmetic function _mm_add_pd.

这篇关于上证所新增中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆