如何使用乘法和ARM Cortex-A8的积累内部函数？ [英] How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

查看：395 发布时间：2016/5/29 14:42:50 c arm simd intrinsics neon

本文介绍了如何使用乘法和ARM Cortex-A8的积累内部函数？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

如何使用GCC提供的乘加的内在函数？

how to use the Multiply-Accumulate intrinsics provided by GCC?

float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);

任何人能解释我有什么三个参数传递给这个函数。我的意思是源和目标寄存器和什么函数返回？

Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?

帮助！

推荐答案

简单地说了VMLA指令执行以下操作：

Simply said the vmla instruction does the following:

struct 
{
  float val[4];
} float32x4_t


float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
  float32x4 result;

  for (int i=0; i<4; i++)
  {
    result.val[i] =  b.val[i]*c.val[i]+a.val[i];
  }

  return result;
}

而这一切编译成单张汇编指令： - ）

And all this compiles into a singe assembler instruction :-)

您可以使用此NEON汇编中的典型4x4矩阵乘法别的事情3D图形像这种内在的：

You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:

float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
  /* in a perfect world this code would compile into just four instructions */
  float32x4_t result;

  result = vml (matrix[0], vector);
  result = vmla (result, matrix[1], vector);
  result = vmla (result, matrix[2], vector);
  result = vmla (result, matrix[3], vector);

  return result;
}

这节省了两个周期的，因为你没有乘法后添加的结果。加入人们常常使用的乘法累加HSA成为主流，这些天（甚至86增加了他们在最近的一些SSE指令集）。

This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).

另外值得一提的是：乘累加这样的操作的非常常见的线性代数和DSP（数字信号处理）应用程序。 ARM是非常聪明和实施的快速路径 Cortex-A8的NEON核心内。这种快速路径踢如果VMLA指令的第一个参数（累加器）是preceding VML或VMLA指令的结果。我会细讲但简而言之这样的指令系列的运行速度比VML / VADD / VML / VADD系列快四倍。

Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.

看看我的简单的矩阵乘法：我确实做到了。由于这种快速路径，它将运行比使用VML执行写入速度大约4倍，添加，代替VMLA。

Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.

这篇关于如何使用乘法和ARM Cortex-A8的积累内部函数？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用乘法和ARM Cortex-A8的积累内部函数？ [英] How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录关闭

如何使用乘法和ARM Cortex-A8的积累内部函数？ [英] How to use the multiply and accumulate intrinsics in ARM Cortex-a8?

问题描述

推荐答案

相关文章

其它硬件开发最新文章

热门教程

热门工具

登录 关闭

登录关闭