matlab/octave-广义矩阵乘法 [英] matlab/octave - Generalized matrix multiplication

查看:123
本文介绍了matlab/octave-广义矩阵乘法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想做一个泛化矩阵乘法的函数.基本上,它应该能够执行标准矩阵乘法,但是应该允许通过任何其他函数来更改两个二进制运算符积/和.

I would like to do a function to generalize matrix multiplication. Basically, it should be able to do the standard matrix multiplication, but it should allow to change the two binary operators product/sum by any other function.

目标是在CPU和内存方面都尽可能地高效.当然,它的效率始终不如A * B,但是这里的重点是操作员的灵活性.

The goal is to be as efficient as possible, both in terms of CPU and memory. Of course, it will always be less efficient than A*B, but the operators flexibility is the point here.

在阅读各种 有趣 线程:

Here are a few commands I could come up after reading various interesting threads:

A = randi(10, 2, 3);
B = randi(10, 3, 4);

% 1st method
C = sum(bsxfun(@mtimes, permute(A,[1 3 2]),permute(B,[3 2 1])), 3)
% Alternative: C = bsxfun(@(a,b) mtimes(a',b), A', permute(B, [1 3 2]))

% 2nd method
C = sum(bsxfun(@(a,b) a*b, permute(A,[1 3 2]),permute(B,[3 2 1])), 3)

% 3rd method (Octave-only)
C = sum(permute(A, [1 3 2]) .* permute(B, [3 2 1]), 3)

% 4th method (Octave-only): multiply nxm A with nx1xd B to create a nxmxd array
C = bsxfun(@(a, b) sum(times(a,b)), A', permute(B, [1 3 2]));
C = C2 = squeeze(C(1,:,:)); % sum and turn into mxd

方法1-3的问题在于,在使用sum()折叠它们之前,它们将生成n个矩阵. 4更好,因为它在bsxfun内执行sum(),但是bsxfun仍然生成n个矩阵(除了它们大多数为空,仅包含一个非零值的矢量作为和,其余填充为0来匹配bsxfun).尺寸要求).

The problem with methods 1-3 are that they will generate n matrices before collapsing them using sum(). 4 is better because it does the sum() inside the bsxfun, but bsxfun still generates n matrices (except that they are mostly empty, containing only a vector of non-zeros values being the sums, the rest is filled with 0 to match the dimensions requirement).

我想要的是类似第4种方法的东西,但是没有多余的0来备用内存.

What I would like is something like the 4th method but without the useless 0 to spare memory.

有什么主意吗?

推荐答案

此处是您发布的解决方案的精略版本,但有一些小的改进.

Here is a slightly more polished version of the solution you posted, with some small improvements.

我们检查行数是否比列数多,或者反之亦然,然后通过选择是将行与矩阵相乘还是将矩阵与列相乘(从而使循环迭代次数最少)来进行相应的乘法.

We check if we have more rows than columns or the other way around, and then do the multiplication accordingly by choosing either to multiply rows with matrices or matrices with columns (thus doing the least amount of loop iterations).

注意:即使行数少于列数,这也不总是最佳策略(按行而不是按列); MATLAB数组存储在列主要顺序中的事实由于元素是连续存储的,因此内存使按列进行切片更加高效.而访问行涉及通过步幅遍历元素(这对缓存不友好-请考虑空间位置).

Note: This may not always be the best strategy (going by rows instead of by columns) even if there are less rows than columns; the fact that MATLAB arrays are stored in a column-major order in memory makes it more efficient to slice by columns, as the elements are stored consecutively. Whereas accessing rows involves traversing elements by strides (which is not cache-friendly -- think spatial locality).

除此之外,代码应处理双精度/单精度,实/复数,全/稀疏(以及不可能组合的错误).它还尊重空矩阵和零维.

Other than that, the code should handle double/single, real/complex, full/sparse (and errors where it is not a possible combination). It also respects empty matrices and zero-dimensions.

function C = my_mtimes(A, B, outFcn, inFcn)
    % default arguments
    if nargin < 4, inFcn = @times; end
    if nargin < 3, outFcn = @sum; end

    % check valid input
    assert(ismatrix(A) && ismatrix(B), 'Inputs must be 2D matrices.');
    assert(isequal(size(A,2),size(B,1)),'Inner matrix dimensions must agree.');
    assert(isa(inFcn,'function_handle') && isa(outFcn,'function_handle'), ...
        'Expecting function handles.')

    % preallocate output matrix
    M = size(A,1);
    N = size(B,2);
    if issparse(A)
        args = {'like',A};
    elseif issparse(B)
        args = {'like',B};
    else
        args = {superiorfloat(A,B)};
    end
    C = zeros(M,N, args{:});

    % compute matrix multiplication
    % http://en.wikipedia.org/wiki/Matrix_multiplication#Inner_product
    if M < N
        % concatenation of products of row vectors with matrices
        % A*B = [a_1*B ; a_2*B ; ... ; a_m*B]
        for m=1:M
            %C(m,:) = A(m,:) * B;
            %C(m,:) = sum(bsxfun(@times, A(m,:)', B), 1);
            C(m,:) = outFcn(bsxfun(inFcn, A(m,:)', B), 1);
        end
    else
        % concatenation of products of matrices with column vectors
        % A*B = [A*b_1 , A*b_2 , ... , A*b_n]
        for n=1:N
            %C(:,n) = A * B(:,n);
            %C(:,n) = sum(bsxfun(@times, A, B(:,n)'), 2);
            C(:,n) = outFcn(bsxfun(inFcn, A, B(:,n)'), 2);
        end
    end
end

比较

无疑,该函数的整体速度较慢,但​​对于较大的函数,它比内置的矩阵乘法要差几个数量级:

Comparison

The function is no doubt slower throughout, but for larger sizes it is orders of magnitude worse than the built-in matrix-multiplication:

        (tic/toc times in seconds)
      (tested in R2014a on Windows 8)

    size      mtimes       my_mtimes 
    ____    __________     _________
     400     0.0026398       0.20282
     600      0.012039       0.68471
     800      0.014571        1.6922
    1000      0.026645        3.5107
    2000       0.20204         28.76
    4000        1.5578        221.51

这是测试代码:

sz = [10:10:100 200:200:1000 2000 4000];
t = zeros(numel(sz),2);
for i=1:numel(sz)
    n = sz(i); disp(n)
    A = rand(n,n);
    B = rand(n,n);

    tic
    C = A*B;
    t(i,1) = toc;
    tic
    D = my_mtimes(A,B);
    t(i,2) = toc;

    assert(norm(C-D) < 1e-6)
    clear A B C D
end

semilogy(sz, t*1000, '.-')
legend({'mtimes','my_mtimes'}, 'Interpreter','none', 'Location','NorthWest')
xlabel('Size N'), ylabel('Time [msec]'), title('Matrix Multiplication')
axis tight


额外

为完整性起见,以下是两种更幼稚的方法来实现广义矩阵乘法(如果要比较性能,请用以下两种方法之一替换my_mtimes函数的最后一部分).我什至不想去烦恼他们过去的时间:)


Extra

For completeness, below are two more naive ways to implement the generalized matrix multiplication (if you want to compare the performance, replace the last part of the my_mtimes function with either of these). I'm not even gonna bother posting their elapsed times :)

C = zeros(M,N, args{:});
for m=1:M
    for n=1:N
        %C(m,n) = A(m,:) * B(:,n);
        %C(m,n) = sum(bsxfun(@times, A(m,:)', B(:,n)));
        C(m,n) = outFcn(bsxfun(inFcn, A(m,:)', B(:,n)));
    end
end

另一种方式(带有三重循环):

And another way (with a triple-loop):

C = zeros(M,N, args{:});
P = size(A,2); % = size(B,1);
for m=1:M
    for n=1:N
        for p=1:P
            %C(m,n) = C(m,n) + A(m,p)*B(p,n);
            %C(m,n) = plus(C(m,n), times(A(m,p),B(p,n)));
            C(m,n) = outFcn([C(m,n) inFcn(A(m,p),B(p,n))]);
        end
    end
end


接下来要尝试什么?

如果要提高性能,您将不得不移至C/C ++ MEX文件以减少解释的MATLAB代码的开销.您仍然可以通过从MEX文件中调用优化的BLAS/LAPACK例程来利用它们(请参见本文的第二部分例如). MATLAB随附了英特尔MKL 库,坦率地说,在英特尔上进行线性代数计算时,您无法击败处理器.


What to try next?

If you want to squeeze out more performance, you're gonna have to move to a C/C++ MEX-file to cut down on the overhead of interpreted MATLAB code. You can still take advantage of optimized BLAS/LAPACK routines by calling them from MEX-files (see the second part of this post for an example). MATLAB ships with Intel MKL library which frankly you cannot beat when it comes to linear algebra computations on Intel processors.

其他人已经在File Exchange上提到了一些提交文件,这些提交文件将通用矩阵例程实现为MEX文件(请参阅 @natan 的答案).如果将它们与优化的BLAS库链接起来,这些方法特别有效.

Others have already mentioned a couple of submissions on the File Exchange that implement general-purpose matrix routines as MEX-files (see @natan's answer). Those are especially effective if you link them against an optimized BLAS library.

这篇关于matlab/octave-广义矩阵乘法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆