是否有Move(_mm_move_ss)和Set(_mm_set_ss)内在函数可用于双精度(__m128d)? [英] Are there Move (_mm_move_ss) and Set (_mm_set_ss) intrinsics that work for doubles (__m128d)?

查看:652
本文介绍了是否有Move(_mm_move_ss)和Set(_mm_set_ss)内在函数可用于双精度(__m128d)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

多年来,我已经看到内在函数在 float 参数中被转换为 __ m128 使用以下代码: __ m128 b = _mm_move_ss(m,_mm_set_ss(a));



例如:

  void MyFunction(float y)
{
__m128 a = _mm_move_ss(m,_mm_set_ss(y) ); // m是__m128
//用'a'
}做任何事情

我想知道是否有类似的方式使用 _mm_move _mm_set 内在函数来做同样的双打( __ m128d )?

解决方案

几乎每个 _ss _ps 内在/指令具有双重版本, _sd _pd 后缀。 (Scalar Double或Packed Double)。



例如,搜索 (double in Intel's intrinsic finder to find将 double 作为第一个参数的内在函数,或者只是弄清楚最优的asm是什么,然后在insn ref手册中查找这些指令的内在函数,除此之外它没有列出所有内在的 movsd ,因此在内在函数查找器中搜索指令名称通常可以工作。



re:头文件:始终包含 immintrin.h> 。它包括所有的英特尔SSE / AVX内在函数。






另请参见< a href =https://stackoverflow.com/a/11766098/224132>将 float 放入向量的方法,一个d关于如何洗牌的链接的 sse 标签wiki向量。 (即 Agner Fog的优化装配指南中的洗牌说明表)



(见下面的一个有趣的编译器输出的godbolt链接)



re:你的序列



如果您确实要合并两个向量,只能使用 _mm_move_ss (或sd)。



您不显示如何定义 m 。您使用 a 作为浮点数和向量的变量名,这意味着向量中唯一有用的信息是 float arg。变量名冲突当然意味着它不会编译。



不幸的是,似乎没有什么办法只是投掷一个 float double 到一个向量中,在上面的3个元素中有垃圾,就像$ code> __ m128 - > __ m256

__ m256 _mm256_castps128_ps256(__m128 a) 。我使用内在函数发布了一个关于这个限制的新问题:如何将一个标量合并成一个向量而不用编译器浪费一个指令使上位元素归零?英特尔内在设计的设计限制?



我尝试使用 _mm_undefined_ps()来实现这一点,希望这个在编译器中会提示它可以将传入的高垃圾留在原位,在

  //不要使用这个,它不会使更好的代码
__m128d double_to_vec_highgarbage(double x){
__m128d undef = _mm_undefined_pd();
__m128d x_zeroupper = _mm_set_sd(x);
return _mm_move_sd(undef,x_zeroupper);
}

但是clang3.8将其编译为

 #clang3.8 -O3 -march = core2 
movq xmm0,xmm0#xmm0 = xmm0 [0],零
ret

所以没有优势,仍然将上半部分归零,而不是将其编译为只有一个 ret 。 gcc实际上是很糟糕的代码:

  double_to_vec_highgarbage:#gcc5.3 -march = nehalem 
movsd QWORD PTR [rsp -16],xmm0#%sfp,x
movsd xmm1,QWORD PTR [rsp-16]#D.26885,%sfp
pxor xmm0,xmm0#__Y
movsd xmm0,xmm1# tmp93,D.26885
ret






_mm_set_sd 似乎是将标量转换为向量的最佳方法。

  __ m128d double_to_vec(double x){
return _mm_set_sd(x);
}

clang将其编译为 movq xmm0,xmm0 ,gcc存储/重新加载 -march = generic






其他有趣的编译器输出从Godbolt编译器资源管理器中的 float double 版本

  float_to_vec:#gcc 5.3 -O3 -march = core2 
movd eax,xmm0#x,x
movd xmm0,eax#D.26867, x
ret

float_to_vec:#gcc5.3 -O3 -march = nehalem
insertps xmm0,xmm0,0xe#D.26867,x
ret

double_to_vec:#gcc5.3 -O3 -march = nehalem。它仍然可以使用movq或插入,而不是这个更长延迟的存储转发往返
movsd QWORD PTR [rsp-16],xmm0#%sfp,x
movsd xmm0,QWORD PTR [rsp -16]#D.26881,%sfp
ret

  float_to_vec:#clang3.8 -O3 -march = core2或generic(no -march)
xorps xmm1,xmm1
movss xmm1,xmm0#xmm1 = xmm0 [0],xmm1 [1,2,3]
movaps xmm0,xmm1
ret

double_to_vec:#clang3.8 -O3 -arch = core2,nehalem或generic(no -march)
movq xmm0,xmm0#xmm0 = xmm0 [0],零
ret


float_to_vec:#clang3 .8 -O3 -march = nehalem
xorps xmm1,xmm1
blendps xmm0,xmm1,14#xmm0 = xmm0 [0],xmm1 [1,2,3]
ret

所以cl和gcc都使用不同的策略为 float vs. double ,即使他们可以



在浮点运算之间使用整数运算,如 movq 导致额外的旁路延迟延迟。使用 insertps 将零输入寄存器的上部元素应该是float或double的最佳策略,所以所有编译器应该在SSE4时使用。 1可用。 xorps + blend也很好,并且可以运行在比插入更多的端口上。商店/重新加载可能是最糟糕的,除非我们在ALU吞吐量方面遇到瓶颈,延迟并不重要。


Over the years, a few times I have seen intrinsics functions with in float parameters that get transformed to __m128 with the following code: __m128 b = _mm_move_ss(m, _mm_set_ss(a));.

For instance:

void MyFunction(float y)
{
    __m128 a = _mm_move_ss(m, _mm_set_ss(y)); //m is __m128
    //do whatever it is with 'a'
}

I wonder if there is a similar way of using _mm_move and _mm_set intrinsics to do the same for doubles (__m128d)?

解决方案

Almost every _ss and _ps intrinsic / instruction has a double version with a _sd or _pd suffix. (Scalar Double or Packed Double).

For example, search (double in Intel's intrinsic finder to find intrinsic functions that take a double as the first arg. Or just figure out what optimal asm would be, then look up the intrinsics for those instructions in the insn ref manual. Except that it doesn't list all the intrinsics for movsd, so searching for an instruction name in the intrinsics finder often works.

re: header files: always just include <immintrin.h>. It includes all Intel SSE/AVX intrinsics.


See also ways to put a float into a vector, and the tag wiki for links about how to shuffle vectors. (i.e. the tables of shuffle instructions in Agner Fog's optimizing assembly guide)

(see below for a godbolt link to some interesting compiler output)

re: your sequence

Only use _mm_move_ss (or sd) if you actually want to merge two vectors.

You don't show how m is defined. Your use of a as the variable name for the float and the vector imply that the only useful information in the vector is the float arg. The variable-name clash of course means it doesn't compile.

There unfortunately doesn't seem to be any way to just "cast" a float or double into a vector with garbage in the upper 3 elements, like there is for __m128 -> __m256:
__m256 _mm256_castps128_ps256 (__m128 a). I posted a new question about this limitation with intrinsics: How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

I tried using _mm_undefined_ps() to achieve this, hoping this would clue in the compiler that it can just leave the incoming high garbage in place, in

// don't use this, it doesn't make better code
__m128d double_to_vec_highgarbage(double x) {
  __m128d undef = _mm_undefined_pd();
  __m128d x_zeroupper = _mm_set_sd(x);
  return _mm_move_sd(undef, x_zeroupper);
}

but clang3.8 compiles it to

    # clang3.8 -O3 -march=core2
    movq    xmm0, xmm0              # xmm0 = xmm0[0],zero
    ret

So no advantage, still zeroing the upper half instead of compiling it to just a ret. gcc actually makes pretty bad code:

double_to_vec_highgarbage:  # gcc5.3 -march=nehalem
    movsd   QWORD PTR [rsp-16], xmm0      # %sfp, x
    movsd   xmm1, QWORD PTR [rsp-16]      # D.26885, %sfp
    pxor    xmm0, xmm0      # __Y
    movsd   xmm0, xmm1    # tmp93, D.26885
    ret


_mm_set_sd appears to be the best way to turn a scalar into a vector.

__m128d double_to_vec(double x) {
  return _mm_set_sd(x);
}

clang compiles it to a movq xmm0,xmm0, gcc to a store/reload with -march=generic.


Other interesting compiler outputs from the float and double versions on the Godbolt compiler explorer

float_to_vec:   # gcc 5.3 -O3 -march=core2
    movd    eax, xmm0       # x, x
    movd    xmm0, eax       # D.26867, x
    ret

float_to_vec:   # gcc5.3 -O3 -march=nehalem
    insertps        xmm0, xmm0, 0xe # D.26867, x
    ret

double_to_vec:    # gcc5.3 -O3 -march=nehalem.  It could still have use movq or insertps, instead of this longer-latency store-forwarding round trip
    movsd   QWORD PTR [rsp-16], xmm0      # %sfp, x
    movsd   xmm0, QWORD PTR [rsp-16]      # D.26881, %sfp
    ret

float_to_vec:   # clang3.8 -O3 -march=core2 or generic (no -march)
    xorps   xmm1, xmm1
    movss   xmm1, xmm0              # xmm1 = xmm0[0],xmm1[1,2,3]
    movaps  xmm0, xmm1
    ret

double_to_vec:  # clang3.8 -O3 -march=core2, nehalem, or generic (no -march)
    movq    xmm0, xmm0              # xmm0 = xmm0[0],zero
    ret


float_to_vec:    # clang3.8 -O3 -march=nehalem
    xorps   xmm1, xmm1
    blendps xmm0, xmm1, 14          # xmm0 = xmm0[0],xmm1[1,2,3]
    ret

So both clang and gcc use different strategies for float vs. double, even when they could use the same strategy.

Using integer operations like movq between floating-point operations causes extra bypass delay latency. Using insertps to zero the upper elements of the input register should be the best strategy for float or double, so all compilers should use that when SSE4.1 is available. xorps + blend is good, too, and can run on more ports than insertps. The store/reload is probably the worst, unless we're bottlenecked on ALU throughput, and latency doesn't matter.

这篇关于是否有Move(_mm_move_ss)和Set(_mm_set_ss)内在函数可用于双精度(__m128d)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆