在整数向量上使用_mm_shuffle_ps的含义 [英] implications of using _mm_shuffle_ps on integer vector

查看:433
本文介绍了在整数向量上使用_mm_shuffle_ps的含义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SSE内在函数包括_mm_shuffle_ps xmm1 xmm2 immx,它允许一个人从xmm1中选择2个元素与从xmm2中的2个元素串联在一起.但这是针对浮点数的(由 _ps 表示,打包成单行).但是,如果您转换打包的整数 __ m128i ,那么您也可以使用 _mm_shuffle_ps :

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:

#include <iostream>
#include <immintrin.h>
#include <sstream>

using namespace std;

template <typename T>
std::string __m128i_toString(const __m128i var) {
    std::stringstream sstr;
    const T* values = (const T*) &var;
    if (sizeof(T) == 1) {
        for (unsigned int i = 0; i < sizeof(__m128i); i++) {
            sstr << (int) values[i] << " ";
        }
    } else {
        for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
            sstr << values[i] << " ";
        }
    }
    return sstr.str();
}



int main(){

  cout << "Starting SSE test" << endl;
  cout << "integer shuffle" << endl;

 int A[] = {1,  -2147483648, 3, 5};
 int B[] = {4, 6, 7, 8};

  __m128i pC;

  __m128i* pA = (__m128i*) A;
  __m128i* pB = (__m128i*) B;

  *pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
  pC = _mm_add_epi32(*pA,*pB);

  cout << "A[0] = " << A[0] << endl;
  cout << "A[1] = " << A[1] << endl;
  cout << "A[2] = " << A[2] << endl;
  cout << "A[3] = " << A[3] << endl;

  cout << "B[0] = " << B[0] << endl;
  cout << "B[1] = " << B[1] << endl;
  cout << "B[2] = " << B[2] << endl;
  cout << "B[3] = " << B[3] << endl;

  cout << "pA = " << __m128i_toString<int>(*pA) << endl;
  cout << "pC = " << __m128i_toString<int>(pC) << endl;
}

相关的相应程序集的片段(在ivybridge CPU上为mac osx,macports gcc 4.8,-march = native):

Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):

vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd  16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call    __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
....

因此,它似乎可以在整数上正常工作,我希望这是因为寄存器与类型无关,但是文档必须说出该指令仅适用于浮点数,这一定是有原因的.有人知道我错过的任何缺点或影响吗?

Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?

推荐答案

整数不等同于_mm_shuffle_ps.要在这种情况下达到相同的效果,您可以

There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do

SSE2

*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);

SSE4.1

*pA = _mm_blend_epi16(*pA, *pB, 0xf0);

更改为浮点域,例如

*pA = _mm_castps_si128( 
        _mm_shuffle_ps(_mm_castsi128_ps(*pA), 
                       _mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));


但更改域可能会在某些CPU上导致旁路延迟延迟 .请记住,根据Agner


But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner

在延迟是瓶颈的长依赖性链中,旁路延迟很重要,但是 而不是重要的吞吐量而不是延迟.

The bypass delay is important in long dependency chains where latency is a bottleneck, but not where it is throughput rather than latency that matters.

您必须测试您的代码,并查看上面哪种方法更有效.

You have to test your code and see which method above is more efficient.

幸运的是,在大多数Intel/AMD CPU上,在大多数整数矢量指令之间使用shufps通常不会受到任何惩罚.阿格纳说:

Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:

例如,在[在Sandybridge]上混合PADDDSHUFPS时,我发现没有延迟.

For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].

Nehalem确实有2条往返于SHUFPS的旁路延迟延迟,但是即使那样,单个SHUFPS仍然通常比多个其他指令快.额外的指令也有延迟,并且会消耗吞吐量.

Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.

反向操作(FP数学指令之间的整数转换)不太安全:

The reverse (integer shuffles between FP math instructions) is not as safe:

在示例8.3a中第112页的 Agner Fog的微体系结构中,他表明在浮点域中使用PSHUFD(_mm_shuffle_epi32)而不是SHUFPS(_mm_shuffle_ps)会导致四个时钟周期的旁路延迟.在例8.3b中,他使用SHUFPS消除了延迟(在他的示例中有效).

In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).

在Nehalem上实际上有五个域. Nahalem似乎是受影响最大的(Nahalem之前不存在旁路延迟).在桑迪桥上,延迟不那么严重.在Haswell上更是如此.实际上,哈斯韦尔·阿格纳(Haswell Agner)表示,他发现SHUFPSPSHUFD之间没有延迟(请参见第140页).

On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).

这篇关于在整数向量上使用_mm_shuffle_ps的含义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆