如何将向量中的值相互添加 [英] How to add values from vector to each other

查看:46
本文介绍了如何将向量中的值相互添加的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的代码中我解决了积分

In my code I solve integral

y=x^2-4x+6

我使用了 SSE - 它允许我一次操作 4 个值.我编写了一个程序,用 0 到 5 的值将这个积分分解为五个 4 元素向量 n1、n2、n3、n4.

I used SSE - it allows me to operate on 4 values in one time. I made program which solve this integral with values from 0 to 5 divided to five 4-element vectors n1, n2, n3, n4.

.data
n1: .float 0.3125,0.625,0.9375,1.25
n2: .float 1.5625,1.875,2.1875,2.5
n3: .float 2.8125,3.12500,3.4375,3.75
n4: .float 4.0625,4.37500,4.6875,5
szostka: .float 6,6,6,6
czworka: .float 4,4,4,4
.text
.global main
main:  
        movups (n1),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm7

        movups (n2),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm6

        movups (n3),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm5

        movups (n4),%xmm0

        mulps %xmm0,%xmm0
        movups (szostka),%xmm2
        addps %xmm2,%xmm0
        movups (n1),%xmm1
        movups (czworka),%xmm2
        mulps %xmm2,%xmm1
        subps %xmm1,%xmm0
        movups %xmm0,%xmm4

        mov $1,%eax
        mov $0,%ebx
        int $0x80 

最后,我在寄存器 xmm7、xmm6、xmm5、xmm4 中有 4 个向量.为了解决积分问题,我需要将向量相互相加(这很容易),然后将向量中的值也相互相加.
我该怎么做?

In the end, I have 4 vectors in registers xmm7, xmm6, xmm5, xmm4. To solve integral, I need to add vectors to each other (which is easy) and then add values from vector also to each other.
How should I do this?

推荐答案

正如 Paul R 在评论中所说,您可以在最后使用 haddps 进行向量内的水平操作.

As Paul R said in a comment, you can use haddps for horizontal ops within a vector, at the end.

您的代码看起来效率低下.如果您要完全展开,而不是使用循环和累加器,您可以首先为每个副本使用不同的寄存器,而不是使用 movups %xmm0,%xmmX 在每个块的结尾.

Your code looks inefficient. If you're going to fully unroll, instead of using a loop and an accumulator, you can use a different register in the first place for each copy, instead of having a movups %xmm0,%xmmX at the end of every block.

此外,在迭代中将 (szostka)(czworka) 保存在寄存器中.不要每次都重新加载它们.类似地,将 movups (n1),%xmm1 替换为 movups %xmm0, %xmm1(在你平方 %xmm0 之前).在 IvyBridge 及更高版本上,寄存器重命名阶段处理 reg-reg 移动,并且它们以零延迟发生.

Also, keep (szostka) and (czworka) in a register across iterations. Don't reload them every time. Similarly, replace movups (n1),%xmm1 with movups %xmm0, %xmm1 (before you square %xmm0). On IvyBridge and later, the register-renaming stage handles reg-reg moves, and they happen with zero latency.

如果您确实需要每次都加载 (szostka),最好将 addps 与内存操作数一起使用,而不是单独的移动和添加.微融合可以使该操作保持为单个 uop.

If you did need to load (szostka) every time, it would be better to use addps with a memory operand, instead of a separate move and add. Micro-fusion could keep that operation as a single uop.

查看 http://agner.org/optimize/ 以获取有关如何优化装配的文档.您可能会发现使用内在函数更有用,让编译器处理寄存器分配等小细节,而不是直接写入 asm.

Check out http://agner.org/optimize/ for docs on how to optimize assembly. You might find it more useful to use intrinsics, to let the compiler take care of small details like register allocation, instead of writing in asm directly.

这篇关于如何将向量中的值相互添加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆