如何在 ARM Cortex-a8 中使用乘法和累加内在函数? [英] How to use the multiply and accumulate intrinsics in ARM Cortex-a8?
问题描述
如何使用 GCC 提供的 Multiply-Accumulate 内在函数?
how to use the Multiply-Accumulate intrinsics provided by GCC?
float32x4_t vmlaq_f32 (float32x4_t , float32x4_t , float32x4_t);
谁能解释一下我必须传递给这个函数的三个参数.我的意思是源和目标寄存器以及函数返回什么?
Can anyone explain what three parameters I have to pass to this function. I mean the Source and destination registers and what the function returns?
帮助!!!
推荐答案
简单说一下vmla指令做了以下事情:
Simply said the vmla instruction does the following:
struct
{
float val[4];
} float32x4_t
float32x4_t vmla (float32x4_t a, float32x4_t b, float32x4_t c)
{
float32x4 result;
for (int i=0; i<4; i++)
{
result.val[i] = b.val[i]*c.val[i]+a.val[i];
}
return result;
}
所有这些都编译成一个单一的汇编指令:-)
And all this compiles into a singe assembler instruction :-)
您可以在 3D 图形的典型 4x4 矩阵乘法中使用这个 NEON 汇编器内在函数,例如:
You can use this NEON-assembler intrinsic among other things in typical 4x4 matrix multiplications for 3D-graphics like this:
float32x4_t transform (float32x4_t * matrix, float32x4_t vector)
{
/* in a perfect world this code would compile into just four instructions */
float32x4_t result;
result = vml (matrix[0], vector);
result = vmla (result, matrix[1], vector);
result = vmla (result, matrix[2], vector);
result = vmla (result, matrix[3], vector);
return result;
}
这节省了几个周期,因为您不必在乘法后将结果相加.加法是如此常用,以至于乘法累加 hsa 成为如今的主流(甚至 x86 也在最近的一些 SSE 指令集中添加了它们).
This saves a couple of cycles because you don't have to add the results after multiplication. The addition is so often used that multiply-accumulates hsa become mainstream these days (even x86 has added them in some recent SSE instruction set).
还值得一提的是:像这样的乘法累加运算在线性代数和 DSP(数字信号处理)应用中非常常见.ARM 非常聪明,在 Cortex-A8 NEON-Core 中实现了快速路径.如果 VMLA 指令的第一个参数(累加器)是前面的 VML 或 VMLA 指令的结果,则此快速路径将启动.我可以详细说明,但简而言之,此类指令系列的运行速度是 VML/VADD/VML/VADD 系列的四倍.
Also worth mentioning: Multiply-accumulate operations like this are very common in linear algebra and DSP (digital signal processing) applications. ARM was very smart and implemented a fast-path inside the Cortex-A8 NEON-Core. This fast-path kicks in if the first argument (the accumulator) of a VMLA instruction is the result of a preceding VML or VMLA instruction. I could go into detail but in a nutshell such an instruction series runs four times faster than a VML / VADD / VML / VADD series.
看看我的简单矩阵乘法:我就是这样做的.由于这种快速路径,它的运行速度大约是使用 VML 和 ADD 而不是 VMLA 编写的实现的四倍.
Take a look at my simple matrix-multiply: I did exactly that. Due to this fast-path it will run roughly four times faster than implementation written using VML and ADD instead of VMLA.
这篇关于如何在 ARM Cortex-a8 中使用乘法和累加内在函数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!