如何将向量中的值相互添加 [英] How to add values from vector to each other
问题描述
在我的代码中我解决了积分
In my code I solve integral
y=x^2-4x+6
我使用了 SSE - 它允许我一次操作 4 个值.我编写了一个程序,用 0 到 5 的值将这个积分分解为五个 4 元素向量 n1、n2、n3、n4.
I used SSE - it allows me to operate on 4 values in one time. I made program which solve this integral with values from 0 to 5 divided to five 4-element vectors n1, n2, n3, n4.
.data
n1: .float 0.3125,0.625,0.9375,1.25
n2: .float 1.5625,1.875,2.1875,2.5
n3: .float 2.8125,3.12500,3.4375,3.75
n4: .float 4.0625,4.37500,4.6875,5
szostka: .float 6,6,6,6
czworka: .float 4,4,4,4
.text
.global main
main:
movups (n1),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm7
movups (n2),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm6
movups (n3),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm5
movups (n4),%xmm0
mulps %xmm0,%xmm0
movups (szostka),%xmm2
addps %xmm2,%xmm0
movups (n1),%xmm1
movups (czworka),%xmm2
mulps %xmm2,%xmm1
subps %xmm1,%xmm0
movups %xmm0,%xmm4
mov $1,%eax
mov $0,%ebx
int $0x80
最后,我在寄存器 xmm7、xmm6、xmm5、xmm4 中有 4 个向量.为了解决积分问题,我需要将向量相互相加(这很容易),然后将向量中的值也相互相加.
我该怎么做?
In the end, I have 4 vectors in registers xmm7, xmm6, xmm5, xmm4. To solve integral, I need to add vectors to each other (which is easy) and then add values from vector also to each other.
How should I do this?
推荐答案
正如 Paul R 在评论中所说,您可以在最后使用 haddps
进行向量内的水平操作.
As Paul R said in a comment, you can use haddps
for horizontal ops within a vector, at the end.
您的代码看起来效率低下.如果您要完全展开,而不是使用循环和累加器,您可以首先为每个副本使用不同的寄存器,而不是使用 movups %xmm0,%xmmX
在每个块的结尾.
Your code looks inefficient. If you're going to fully unroll, instead of using a loop and an accumulator, you can use a different register in the first place for each copy, instead of having a movups %xmm0,%xmmX
at the end of every block.
此外,在迭代中将 (szostka)
和 (czworka)
保存在寄存器中.不要每次都重新加载它们.类似地,将 movups (n1),%xmm1
替换为 movups %xmm0, %xmm1
(在你平方 %xmm0
之前).在 IvyBridge 及更高版本上,寄存器重命名阶段处理 reg-reg 移动,并且它们以零延迟发生.
Also, keep (szostka)
and (czworka)
in a register across iterations. Don't reload them every time. Similarly, replace movups (n1),%xmm1
with movups %xmm0, %xmm1
(before you square %xmm0
). On IvyBridge and later, the register-renaming stage handles reg-reg moves, and they happen with zero latency.
如果您确实需要每次都加载 (szostka)
,最好将 addps
与内存操作数一起使用,而不是单独的移动和添加.微融合可以使该操作保持为单个 uop.
If you did need to load (szostka)
every time, it would be better to use addps
with a memory operand, instead of a separate move and add. Micro-fusion could keep that operation as a single uop.
查看 http://agner.org/optimize/ 以获取有关如何优化装配的文档.您可能会发现使用内在函数更有用,让编译器处理寄存器分配等小细节,而不是直接写入 asm.
Check out http://agner.org/optimize/ for docs on how to optimize assembly. You might find it more useful to use intrinsics, to let the compiler take care of small details like register allocation, instead of writing in asm directly.
这篇关于如何将向量中的值相互添加的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!