总和四字向量的所有元素与NEON ARM汇编 [英] Sum all elements in a quadword vector in ARM assembly with NEON

查看：1204 发布时间：2016/5/29 14:48:33 math assembly arm neon

本文介绍了总和四字向量的所有元素与NEON ARM汇编的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

林相当新的装配，虽然手臂信息中心往往是有帮助的，有时指令可以有点混乱，以一个新手。基本上我需要做的是一笔4浮点值的四字寄存器和结果存储在一个单一的precision寄存器。我认为，该指令可以VPADD做什么，我需要，但我不能肯定。

Im rather new to assembly and although the arm information center is often helpful sometimes the instructions can be a little confusing to a newbie. Basically what I need to do is sum 4 float values in a quadword register and store the result in a single precision register. I think the instruction VPADD can do what I need but I'm not quite sure.

推荐答案

看来你想要得到阵列一定长度的总和，而不仅仅是四个浮点值。

It seems that you want to get the sum of a certain length of array, and not only four float values.

在这种情况下，你的code的工作，但还远未优化：

In that case, your code will work, but is far from optimized :

很多很多管道互锁

many many pipeline interlocks

每次迭代不必要的32位除了

unnecessary 32bit addition per iteration

假设阵列的长度是和8的倍数至少为16：

Assuming the length of the array is a multiple of 8 and at least 16 :

  vldmia {q0-q1}, [pSrc]!
  sub count, count, #8
loop:
  pld [pSrc, #32]
  vldmia {q3-q4}, [pSrc]!
  subs count, count, #8
  vadd.f32 q0, q0, q3
  vadd.f32 q1, q1, q4
  bgt loop

  vadd.f32 q0, q0, q1
  vpadd.f32 d0, d0, d1
  vadd.f32 s0, s0, s1

PLD - 而作为一个ARM指令，而不是NEON - 是性能是至关重要的。这大大增加缓存的命中率。

我希望上面的code的其余部分是自我解释。

I hope the rest of the code above is self explanatory.

您会发现，这个版本比最初的一快许多倍。

You will notice that this version is many times faster than your initial one.

这篇关于总和四字向量的所有元素与NEON ARM汇编的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

总和四字向量的所有元素与NEON ARM汇编 [英] Sum all elements in a quadword vector in ARM assembly with NEON

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

总和四字向量的所有元素与NEON ARM汇编 [英] Sum all elements in a quadword vector in ARM assembly with NEON

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭