iOS 4使用4x4矩阵加速Cblas [英] iOS 4 Accelerate Cblas with 4x4 matrices

查看:190
本文介绍了iOS 4使用4x4矩阵加速Cblas的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在研究在iOS 4中提供的Accelerate框架。具体来说,我尝试在C中的线性代数库中使用Cblas例程。现在我无法使用这些函数在非常基本的惯例中给我任何性能提升。具体来说,是4x4矩阵乘法的情况。无论何时我无法使用矩阵的仿射或同质属性,我一直在使用这个例程(删节):

I’ve been looking into the Accelerate framework that was made available in iOS 4. Specifically, I made some attempts to use the Cblas routines in my linear algebra library in C. Now I can’t get the use of these functions to give me any performance gain over very basic routines. Specifically, the case of 4x4 matrix multiplication. Wherever I couldn’t make use of affine or homogeneous properties of the matrices, I’ve been using this routine (abridged):

float *mat4SetMat4Mult(const float *m0, const float *m1, float *target) {
    target[0] = m0[0] * m1[0] + m0[4] * m1[1] + m0[8] * m1[2] + m0[12] * m1[3];
    target[1] = ...etc...
    ...
    target[15] = m0[3] * m1[12] + m0[7] * m1[13] + m0[11] * m1[14] + m0[15] * m1[15];
    return target;
}

Cblas的等效函数调用是:

The equivalent function call for Cblas is:

cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
   4, 4, 4, 1.f, m0, 4, m1, 4, 0.f, target, 4);

比较两者,使它们运行大量充满随机数的预先计算的矩阵(每个函数)每次都获得完全相同的输入),当使用C clock()函数计时时,Cblas例程执行速度大约慢4倍。

Comparing the two, by making them run through a large number of pre-computed matrices filled with random numbers (each function gets the exact same input every time), the Cblas routine performs about 4x slower, when timed with the C clock() function.

这对我来说似乎不对而且我感觉我在某处做错了什么。我是否必须以某种方式启用设备的NEON设备和SIMD功能?或者我不希望这些小矩阵能有更好的表现吗?

This doesn’t seem right to me, and I’m left with the feeling that I’m doing something wrong somewhere. Do I have to to enable the device’s NEON unit and SIMD functionality somehow? Or shouldn’t I hope for better performance with such small matrices?

非常感谢,

巴斯蒂安安

推荐答案

BLAS和LAPACK库设计用于我认为的中到大矩阵(从几十到几十)成千上万的人)。它们将为较小的矩阵提供正确的结果,但性能不会尽可能好。

The BLAS and LAPACK libraries are designed for use with what I would consider "medium to large matrices" (from tens to tens of thousands on a side). They will deliver correct results for smaller matrices, but the performance will not be as good as it could be.

这有几个原因:


  • 为了提供最佳性能,必须内联3x3和4x4矩阵操作,而不是在库中;当需要完成的工作很少时,进行函数调用的开销太大而无法克服。

  • 提供最佳性能需要完全不同的接口集。矩阵乘法的BLAS接口采用变量来指定计算中涉及的矩阵的大小和前导维度,更不用说是否转置矩阵和存储布局。所有这些参数使库变得强大,并且不会损害大型矩阵的性能。但是,当它完成确定您正在进行4x4计算时,专用于执行4x4矩阵运算的功能已经完成。

这对您意味着什么:如果您希望提供专用的小矩阵操作,请访问bugreport.apple.com并提交请求此功能的错误。

What this means for you: if you would like to have dedicated small matrix operations provided, please go to bugreport.apple.com and file a bug requesting this feature.

这篇关于iOS 4使用4x4矩阵加速Cblas的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆