英特尔AVX:256位版本的点乘积用于双精度浮点变量 [英] Intel AVX : 256-bits version of dot product for double precision floating point variables

查看:3117
本文介绍了英特尔AVX:256位版本的点乘积用于双精度浮点变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

英特尔高级向量扩展(AVX)在256位版本(ymm寄存器)中不提供双精度浮点变量的点积。 为什么?问题已在另一个论坛(此处)和SO( 此处)。但我面临的问题是如何以有效的方式用其他AVX指令替换这个缺少的指令?



256位版本的点积存在于单精度FP变量中(ref此处):

  __m256 _mm256_dp_ps(__ m256 m1,__m256 m2 ,const int mask); 

这个想法是为这个缺少的指令找到一个有效的等价物:

  __m256d _mm256_dp_pd(__ m256d m1,__m256d m2,const int mask);感谢您的建议。



编辑您的建议。


$ b $ p < :



更具体地说,代码我想从 __ m128 (4个浮动)转换为 __ m256d (4双)使用以下说明:

  __m128 val0 = 。 // 4 float values 
__m128 val1 = ...; //
__m128 val2 = ...; //
__m128 val3 = ...; //
__m128 val4 = ...; //

__m128 res = _mm_or_ps(_mm_dp_ps(val1,val0,0xF1),
_mm_or_ps(_mm_dp_ps(val2,val0,0xF2),
_mm_or_ps ,0xF4),
_mm_dp_ps(val4,val0,0xF8))));

此代码的结果是 _m128 4个浮点数的矢量,包含 val1 val0 val2 val0 val3 val0 val4 val0



$ p

解决方案

我将使用一个4 *双乘法,然后一个 hadd (不幸的是在上半部分和下半部分只添加了2 * 2个浮动),提取上半部分(一个shuffle应该工作相同,也许更快),并将它添加到下半部分。



结果是在 dotproduct 的低64位中。

  __ m256d xy = _mm256_mul_pd(x,y); 
__m256d temp = _mm256_hadd_pd(xy,xy);
__m128d hi128 = _mm256_extractf128_pd(temp,1);
__m128d dotproduct = _mm_add_pd((__m128d)temp,hi128);

编辑:

在想到Norbert P.之后,我将此版本扩展到

  __ m256d xy0 = _mm256_mul_pd(x [0],y [0]); 
__m256d xy1 = _mm256_mul_pd(x [1],y [1]);
__m256d xy2 = _mm256_mul_pd(x [2],y [2]);
__m256d xy3 = _mm256_mul_pd(x [3],y [3]);

//低到高:xy00 + xy01 xy10 + xy11 xy02 + xy03 xy12 + xy13
__m256d temp01 = _mm256_hadd_pd(xy0,xy1);

//低到高:xy20 + xy21 xy30 + xy31 xy22 + xy23 xy32 + xy33
__m256d temp23 = _mm256_hadd_pd(xy2,xy3)

//低到高:xy02 + xy03 xy12 + xy13 xy20 + xy21 xy30 + xy31
__m256d swapped = _mm256_permute2f128_pd(temp01,temp23,0x21);

// low to high:xy00 + xy01 xy10 + xy11 xy22 + xy23 xy32 + xy33
__m256d blended = _mm256_blend_pd(temp01,temp23,0b1100);

__m256d dotproduct = _mm256_add_pd(swapped,blended);


The Intel Advanced Vector Extensions (AVX) offers no dot product in 256-bit version (ymm register) for double precision floating point variables. The "Why?" question have been very briefly treated in another forum (here) and on SO (here). But the question I am facing is how to replace this missing instruction with other AVX instructions in an efficient way?

The dot product in 256-bit version exists for single precision FP variables (ref here):

 __m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);

The idea is to find an efficient equivalent for this missing instruction:

 __m256d _mm256_dp_pd(__m256d m1, __m256d m2, const int mask);

Thanks for your suggestions.

Edit:

To be more specific, the code i would like to transform from __m128 (4 floats) to __m256d (4 doubles) use the following instructions :

   __m128 val0 = ...; // 4 float values
   __m128 val1 = ...; //
   __m128 val2 = ...; //
   __m128 val3 = ...; //
   __m128 val4 = ...; //

   __m128 res = _mm_or_ps( _mm_dp_ps(val1,  val0,   0xF1), 
                _mm_or_ps( _mm_dp_ps(val2,  val0,   0xF2), 
                _mm_or_ps( _mm_dp_ps(val3,  val0,   0xF4), 
                           _mm_dp_ps(val4,  val0,   0xF8) )));

The result of this code is a _m128 vector of 4 floats containing the results of the dot products between val1 and val0, val2 and val0, val3 and val0, val4 and val0.

Maybe this can give hints for the suggestions ?

解决方案

I would use a 4*double multiplication, then a hadd (which unfortunately adds only 2*2 floats in the upper and lower half), extract the upper half (a shuffle should work equally, maybe faster) and add it to the lower half.

The result is in the low 64 bit of dotproduct.

__m256d xy = _mm256_mul_pd( x, y );
__m256d temp = _mm256_hadd_pd( xy, xy );
__m128d hi128 = _mm256_extractf128_pd( temp, 1 );
__m128d dotproduct = _mm_add_pd( (__m128d)temp, hi128 );

Edit:
After an idea of Norbert P. I extended this version to do 4 dot products at one time.

__m256d xy0 = _mm256_mul_pd( x[0], y[0] );
__m256d xy1 = _mm256_mul_pd( x[1], y[1] );
__m256d xy2 = _mm256_mul_pd( x[2], y[2] );
__m256d xy3 = _mm256_mul_pd( x[3], y[3] );

// low to high: xy00+xy01 xy10+xy11 xy02+xy03 xy12+xy13
__m256d temp01 = _mm256_hadd_pd( xy0, xy1 );   

// low to high: xy20+xy21 xy30+xy31 xy22+xy23 xy32+xy33
__m256d temp23 = _mm256_hadd_pd( xy2, xy3 );

// low to high: xy02+xy03 xy12+xy13 xy20+xy21 xy30+xy31
__m256d swapped = _mm256_permute2f128_pd( temp01, temp23, 0x21 );

// low to high: xy00+xy01 xy10+xy11 xy22+xy23 xy32+xy33
__m256d blended = _mm256_blend_pd(temp01, temp23, 0b1100);

__m256d dotproduct = _mm256_add_pd( swapped, blended );

这篇关于英特尔AVX:256位版本的点乘积用于双精度浮点变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆