矢量*矩阵产品效率问题 [英] vector * matrix product efficiency issue

查看:284
本文介绍了矢量*矩阵产品效率问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正如Z boson 推荐,我使用列 - 主矩阵格式,以避免必须使用点积。我没有看到一个可行的方法来避免它,当一个向量与一个矩阵相乘。矩阵乘法技巧需要有效提取行(或列,如果我们转置产品)。为了将向量乘以矩阵,我们转置:

 (b * A)^ T = A ^ T * b ^ T 

A c $ c> b 行向量,其在被转置之后变为列向量。它的行只是单个标量,并且向量*矩阵乘积实现变成(非转置)矩阵的列的点积的低效实现 A b 。有没有办法避免执行这些点产品?

解决方案

这可以从原来的邮件上理解这一点(我的第一个SO)
effective-4x4-matrix-vector-multiplication-with-sse-horizo​​ntal-add-and-dot-prod



这里有两种方法做矩阵时间向量(v = Mu,其中v和u是列向量)

方法2)v = u1 * col1 + u2 * col2 + u3 * col3 + u4 * col4。

第一种方法更熟悉数学类,第二种方法对SIMD计算机更有效。第二种方法使用矢量化数学(例如numpy)。

  u1 * col1 =(u1x * col1x,u1y * col1y,u1z * col1z,u1w * col1w)。现在让我们来看看向量时间矩阵(v = uM,其中v和u是行向量)

方法1)v1 = dot(col1,u),v2 = dot(col2,u),v3 = dot(col3,u),p>

  ),v4 = dot(col4,u)
方法2)v = u1 * row1 + u2 * row2 + u3 * row3 + u4 * row4。

现在,列和行的角色已交换,但方法2仍然是在SIMD计算机。



要在SIMD计算机上有效地执行矩阵时间向量,矩阵应以列为主的顺序存储。要在SIMD计算机上执行矢量时间矩阵有效性矩阵应该以行主要顺序存储。



据我所知,OpenGL使用列主要排序,向量和DirectX使用行主要排序和向量时间矩阵。
如果你有三个矩阵变换,你按M1先做然后M2然后M3矩阵时间向量你写它

  v = M3 * M2 * M1 * u // u和v是列向量 -  OpenGL形式

使用向量时间矩阵写

  v = u * M1 * M2 * M3 // u和v是行向量 - DirectX form 

在效率方面,两种形式都不比另一种更好。



重要的是要注意,矩阵*矩阵行主要和列主要存储是不相关的。



如果你想知道为什么垂直SIMD指令比水平的快,这是一个单独的问题,应该问,但简而言之,以串行方式而不是并行方式工作,并分解成多个微操作(这就是为什么讽刺的是 dppd dpps )。


Just as Z boson recommended, I am using a column-major matrix format in order to avoid having to use the dot product. I don't see a feasible way to avoid it when multiplying a vector with a matrix, though. The matrix multiplication trick requires efficient extraction of rows (or columns, if we transpose the product). To multiply a vector by a matrix, we therefore transpose:

(b * A)^T = A^T * b^T

A is a matrix, b a row vector, which, after being transposed, becomes a column vector. Its rows are just single scalars and the vector * matrix product implementation becomes an inefficient implementation of dot products of columns of (non-transposed) matrix A with b. Is there a way to avoid performing these dot products? The only way I see that could do it, would involve row extraction, which is inefficient with the column-major matrix format.

解决方案

This can be understood from original post on this (my first on SO) efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod . The rest of the discussion applies to 4x4 matrices.

Here are two methods to do do matrix times vector (v = Mu where v and u are column vectors)

method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
method 2) v = u1*col1 + u2*col2 + u3*col3 + u4*col4.

The first method is more familiar from math class while the second is more efficient for a SIMD computer. The second method uses vectorized math (like numpy) e.g.

u1*col1 = (u1x*col1x, u1y*col1y, u1z*col1z, u1w*col1w).

Now let's look at vector times matrix (v = uM where v and u are row vectors)

method 1) v1 = dot(col1, u), v2 = dot(col2, u), v3 = dot(col3, u), v4 = dot(col4, u)
method 2) v = u1*row1 + u2*row2 + u3*row3 + u4*row4.

Now the roles of columns and rows have swapped but method 2 is still the efficient method to use on a SIMD computer.

To do matrix times vector efficiently on a SIMD computer the matrix should be stored in column-major order. To do vector times matrix efficient on a SIMD computer the matrix should be stored in row-major order.

As far as I understand OpenGL uses column major ordering and does matrix times vector and DirectX uses row-major ordering and does vector times matrix. If you have three matrix transformations that you do in order M1 first then M2 then M3 with matrix times vector you write it as

v = M3*M2*M1*u //u and v are column vectors - OpenGL form

With vector times matrix you write

v = u*M1*M2*M3 //u and v are row vectors - DirectX form

Neither form is better than the other in terms of efficiency. It's just a question of notation (and causing confusion which is useful when you have competition).

It's important to note that for matrix*matrix row-major versus column-major storage is irrelevant.

If you want to know why the vertical SIMD instructions are faster than the horizontal ones that's a separate question which should be asked but in short the horizontal ones really act in serial rather than parallel and are broken up into several micro-ops (which is why ironically dppd is faster than dpps).

这篇关于矢量*矩阵产品效率问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆