使用 AVX 一次性完成 4 个水平双精度求和 [英] 4 horizontal double-precision sums in one go with AVX

查看:29
本文介绍了使用 AVX 一次性完成 4 个水平双精度求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题可以描述如下.

输入

__m256d a, b, c, d

输出

__m256d s = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
             c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

到目前为止我所做的工作

这似乎很容易:两个 VHADD 之间有一些改组,但实际上结合 AVX 所具有的所有排列不能生成实现该目标所需的排列.让我解释一下:

It seemed easy enough: two VHADD with some shuffling in-between but in fact combining all permutations featured by AVX can't generate the very permutation needed to achieve that goal. Let me explain:

VHADD x, a, b => x = {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
VHADD y, c, d => y = {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}

我是否能够以相同的方式排列 x 和 y 以获得

Were I able to permute x and y in the same manner to get

x1 = {a[0]+a[1], a[2]+a[3], c[0]+c[1], c[2]+c[3]}
y1 = {b[0]+b[1], b[2]+b[3], d[0]+d[1], d[2]+d[3]}

然后

VHADD s, x1, y1 => s1 = {a[0]+a[1]+a[2]+a[3], b[0]+b[1]+b[2]+b[3], 
                         c[0]+c[1]+c[2]+c[3], d[0]+d[1]+d[2]+d[3]}

这是我想要的结果.

所以我只需要找到如何执行

Thus I just need to find how to perform

x,y => {x[0], x[2], y[0], y[2]}, {x[1], x[3], y[1], y[3]}

不幸的是,我得出的结论是,使用 VSHUFPD、VBLENDPD、VPERMILPD、VPERM2F128、VUNPCKHPD、VUNPCKLPD 的任何组合都证明是不可能的.问题的关键在于,在__m256d的实例u中,u[1]和u[2]是不可能交换的.

Unfortunately I came to the conclusion that this is provably impossible using any combination of VSHUFPD, VBLENDPD, VPERMILPD, VPERM2F128, VUNPCKHPD, VUNPCKLPD. The crux of the matter is that it is impossible to swap u[1] and u[2] in an instance u of __m256d.

问题

这真的是死胡同吗?还是我错过了排列指令?

Is this really a dead end? Or have I missed a permutation instruction?

推荐答案

VHADD 指令后面是常规的VADD.下面的代码应该给你你想要的:

VHADD instructions are meant to be followed by regular VADD. The following code should give you what you want:

// {a[0]+a[1], b[0]+b[1], a[2]+a[3], b[2]+b[3]}
__m256d sumab = _mm256_hadd_pd(a, b);
// {c[0]+c[1], d[0]+d[1], c[2]+c[3], d[2]+d[3]}
__m256d sumcd = _mm256_hadd_pd(c, d);

// {a[0]+a[1], b[0]+b[1], c[2]+c[3], d[2]+d[3]}
__m256d blend = _mm256_blend_pd(sumab, sumcd, 0b1100);
// {a[2]+a[3], b[2]+b[3], c[0]+c[1], d[0]+d[1]}
__m256d perm = _mm256_permute2f128_pd(sumab, sumcd, 0x21);

__m256d sum =  _mm256_add_pd(perm, blend);

这给出了 5 条指令的结果.我希望我得到了正确的常量.

This gives the result in 5 instructions. I hope I got the constants right.

您提出的排列当然是可以完成的,但需要多条指令.抱歉,我没有回答你的那部分问题.

The permutation that you proposed is certainly possible to accomplish, but it takes multiple instructions. Sorry that I'm not answering that part of your question.

我无法抗拒,这是完整的排列.(再次,尽我所能尝试使常量正确.)您可以看到交换 u[1]u[2] 是可能的,只需要一点时间工作的.在第一代中跨越 128 位的障碍是很困难的.AVX.我还想说 VADDVHADD 更可取,因为 VADD 具有两倍的吞吐量,即使它执行相同数量的加法.

I couldn't resist, here's the complete permutation. (Again, did my best to try to get the constants right.) You can see that swapping u[1] and u[2] is possible, just takes a bit of work. Crossing the 128bit barrier is difficult in the first gen. AVX. I also want to say that VADD is preferable to VHADD because VADD has twice the throughput, even though it's doing the same number of additions.

// {x[0],x[1],x[2],x[3]}
__m256d x;

// {x[1],x[0],x[3],x[2]}
__m256d xswap = _mm256_permute_pd(x, 0b0101);

// {x[3],x[2],x[1],x[0]}
__m256d xflip128 = _mm256_permute2f128_pd(xswap, xswap, 0x01);

// {x[0],x[2],x[1],x[3]} -- not imposssible to swap x[1] and x[2]
__m256d xblend = _mm256_blend_pd(x, xflip128, 0b0110);

// repeat the same for y
// {y[0],y[2],y[1],y[3]}
__m256d yblend;

// {x[0],x[2],y[0],y[2]}
__m256d x02y02 = _mm256_permute2f128_pd(xblend, yblend, 0x20);

// {x[1],x[3],y[1],y[3]}
__m256d x13y13 = _mm256_permute2f128_pd(xblend, yblend, 0x31);

这篇关于使用 AVX 一次性完成 4 个水平双精度求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆