使用AVX一次性完成4个水平双精度求和 [英] 4 horizontal double-precision sums in one go with AVX

查看:382
本文介绍了使用AVX一次性完成4个水平双精度求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题可以描述如下.

输入

__m256d a, b, c, d

输出

__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
             c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

我到目前为止已经完成的工作

这似乎很容易:两个VHADD之间有一些改组,但实际上将AVX的所有排列组合起来并不能产生实现该目标所需的排列.让我解释一下:

It seemed easy enough: two VHADD with some shuffling in-between but in fact combining all permutations featured by AVX can't generate the very permutation needed to achieve that goal. Let me explain:

VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}

我能够以相同的方式置换x和y以获得

Were I able to permute x and y in the same manner to get

x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}

然后

VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
                         c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

这是我想要的结果.

因此,我只需要找到执行方法

Thus I just need to find how to perform

x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}

不幸的是,我得出的结论是,使用VSHUFPD,VBLENDPD,VPERMILPD,VPERM2F128,VUNPCKHPD和VUNPCKLPD的任何组合都是不可能实现的.问题的症结在于不可能在__m256d的实例u中交换u [1]和u [2].

Unfortunately I came to the conclusion that this is provably impossible using any combination of VSHUFPD, VBLENDPD, VPERMILPD, VPERM2F128, VUNPCKHPD, VUNPCKLPD. The crux of the matter is that it is impossible to swap u[1] and u[2] in an instance u of __m256d.

问题

这真的是一个死胡同吗?还是我错过了排列指令?

Is this really a dead end? Or have I missed a permutation instruction?

推荐答案

VHADD指令应紧跟常规VADD.以下代码应为您提供所需的内容:

VHADD instructions are meant to be followed by regular VADD. The following code should give you what you want:

// {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
__m256d sumab = _mm256_hadd_pd(a, b);
// {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
__m256d sumcd = _mm256_hadd_pd(c, d);

// {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
__m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
// {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
__m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);

__m256d sum =  _mm256_add_pd(perm, blend);

这将在5条指令中给出结果.我希望我的常数正确.

This gives the result in 5 instructions. I hope I got the constants right.

您提出的排列当然可以完成,但是需要多条指令.抱歉,我没有回答您问题的那一部分.

The permutation that you proposed is certainly possible to accomplish, but it takes multiple instructions. Sorry that I'm not answering that part of your question.

我无法抗拒,这是完整的排列. (再次,我尽了最大的努力来使常数正确.)您可以看到交换u[1]u[2]是可能的,只需要一点工作.在第一代中,跨越128位的障碍是困难的. AVX.我还想说VADDVHADD更可取,因为VADD具有两倍的吞吐量,即使它执行相同数量的加法运算.

I couldn't resist, here's the complete permutation. (Again, did my best to try to get the constants right.) You can see that swapping u[1] and u[2] is possible, just takes a bit of work. Crossing the 128bit barrier is difficult in the first gen. AVX. I also want to say that VADD is preferable to VHADD because VADD has twice the throughput, even though it's doing the same number of additions.

// {x[0],x[1],x[2],x[3]}
__m256d x;

// {x[1],x[0],x[3],x[2]}
__m256d xswap = _mm256_permute_pd(x, 0b0101);

// {x[3],x[2],x[1],x[0]}
__m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);

// {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
__m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);

// repeat the same for y
// {y[0],y[2],y[1],y[3]}
__m256d yblend;

// {x[0],x[2],y[0],y[2]}
__m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);

// {x[1],x[3],y[1],y[3]}
__m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);

这篇关于使用AVX一次性完成4个水平双精度求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆