什么时候将'crossprod'首选为'%*%',而什么时候不是呢? [英] When is 'crossprod' preferred to '%*%', and when isn't?

查看:89
本文介绍了什么时候将'crossprod'首选为'%*%',而什么时候不是呢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当X和Y都是矩阵时,crossprod(X,Y)确切地比t(X) %*% Y优先?文档说

When exactly is crossprod(X,Y) preferred to t(X) %*% Y when X and Y are both matrices? The documentation says

给出矩阵xy作为参数,返回矩阵叉积. 这在形式上相当于(但通常比)快 呼叫t(x) %*% y(crossprod)或x %*% t(y)(tcrossprod).

Given matrices x and y as arguments, return a matrix cross-product. This is formally equivalent to (but usually slightly faster than) the call t(x) %*% y (crossprod) or x %*% t(y) (tcrossprod).

那么什么时候不是更快?在网上搜索时,我发现有几个消息来源指出crossprod通常是首选,并且应将其作为默认值(例如道格拉斯·贝茨(Douglas Bates,2007年)表示 :

So when is it not faster? When searching online I found several sources that stated either that crossprod is generally preferred and should be used as default (e.g. here), or that it depends and results of comparisons are inconclusive. For example, Douglas Bates (2007) says that:

请注意,在较新版本的R和BLAS库中(截至夏季) 2007年),R的%*%能够检测到mm和快捷方式中的许多零 许多运算,因此对于这样的稀疏矩阵要快得多 比目前未使用的crossprod 优化[...]

Note that in newer versions of R and the BLAS library (as of summer 2007), R’s %*% is able to detect the many zeros in mm and shortcut many operations, and is hence much faster for such a sparse matrix than crossprod which currently does not make use of such optimizations [...]

那么我什么时候应该使用%*%,什么时候应该使用crossprod?

So when should I use %*% and when crossprod?

推荐答案

我已经在有关统计计算问题的一些答案中简要介绍了此问题. 我的答案跟进部分比较全面.请注意,我在此处的讨论以及以下内容均假定您知道 BLAS 是什么,尤其是3级BLAS例程dgemmdsyrk是什么.

I have briefly covered this issue in several my answers on statistical computing questions. The follow up part of my this answer is relatively comprehensive. Note, my discussion there, as well as the following really assumes that you know what BLAS is, especially what level-3 BLAS routines dgemm and dsyrk are.

我在这里的回答将提供一些证据和基准,以向您保证我在那里的论点.此外,还将提供道格拉斯·贝茨(Douglas Bates)评论的解释.

My answer here, will provide some evidence and benchmarking to assure you of my argument over there. Additionally, explanation for Douglas Bates' comment will be given.

让我们检查两个例程的源代码. R base 包的C级源代码主要在

Let's check the source code for both routines. C level source code for R base package is largely found under

R-release/src/main

尤其是,矩阵运算是在

R-release/src/main/array.c

现在

  • "%*%"与C例程matprod匹配,该例程使用transa = "N"transb = "N"调用dgemm;
  • 很容易看到
  • crossprod复合函数,与symcrossprodcrossprodsymtcrossprodtcrossprod匹配(复杂矩阵的对应物,例如ccrossprod,此处未列出).
  • "%*%" matches to a C routine matprod, which calls dgemm with transa = "N" and transb = "N";
  • crossprod is readily seen to be a composite function, matching to symcrossprod, crossprod, symtcrossprod and tcrossprod (counterparts for complex matrices, like ccrossprod, are not listed here).

您现在应该了解,crossprod避免了所有显式矩阵转置. crossprod(A, B)t(A) %*% B便宜,因为后者需要A的显式矩阵转置.在下文中,我将其称为转置开销.

You should now understand that crossprod avoids all explicit matrix transposition. crossprod(A, B) is cheaper than t(A) %*% B, as the latter needs an explicit matrix transposition for A. In the following, I will refer to this as transposition overhead.

R级配置文件可能会暴露此开销.考虑以下示例:

R level profiling can expose this overhead. Consider the following example:

A <- matrix(ruinf(1000 * 1000), 1000)
B <- matrix(ruinf(1000 * 1000), 1000)

Rprof("brutal.out")
result <- t.default(A) %*% B
Rprof(NULL)
summaryRprof("brutal.out")

Rprof("smart.out")
result <- crossprod(A, B)
Rprof(NULL)
summaryRprof("smart.out")

请注意第一种情况下t.default如何进入分析结果.

Note how t.default comes into the profiling result for the first case.

配置文件还为执行提供了CPU时间.您将看到两者似乎花费了相同的时间,因为开销微不足道.现在,我将告诉您什么时候头顶的麻烦.

Profiling also gives CPU time for the execution. You will see that both seems to have taken the same amount of time, as the overhead is insignificant. Now, I will show you when the overhead is a pain.

Ak * m矩阵,而Bk * n矩阵,则矩阵乘法A'B('表示转置)具有FLOP计数(浮点加法和乘法的量)2mnk.如果执行t(A) %*% B,则转置开销为mk(A中的元素数),因此比率

Let A be a k * m matrix, and B a k * n matrix, then matrix multiplication A'B (' means transpose) has FLOP count (amount of floating point addition and multiplication) 2mnk. If you do t(A) %*% B, transposition overhead is mk (the number of elements in A), so the ratio

useful computation : overhead = 2n : 1

除非n很大,否则在实际的矩阵乘法中无法摊销转置开销.

Unless n is big, transposition overhead can not be amortized in actual matrix multiplication.

最极端的情况是n = 1,即B的单列矩阵(或向量).考虑一个基准测试:

The most extreme case is n = 1, i.e., you have a single-column matrix (or a vector) for B. Consider a benchmarking:

library(microbenchmark)
A <- matrix(runif(2000 * 2000), 2000)
x <- runif(2000)
microbenchmark(t.default(A) %*% x, crossprod(A, x))

您会看到crossprod快几倍!

正如我的链接答案中所述(以及贝茨的基准测试结果注释中所述),如果您执行A'A,则crossprod肯定会快2倍.

As mentioned in my linked answer (as well as in Bates' notes with benchmarking result), if you do A'A, crossprod is sure to be 2 times faster.

具有稀疏矩阵不会更改上面的基本结论.用于设置稀疏矩阵的R包Matrix也具有针对"%*%"crossprod的稀疏计算方法.因此,您仍然应该期望crossprod会稍微快一些.

Having sparse matrices does not change the fundamental conclusion above. R package Matrix for setting up sparse matrices also has sparse computation methods for "%*%" and crossprod. So you should still expect crossprod to be slightly faster.

这与稀疏矩阵无关. BLAS严格用于稠密的数值线性代数.

This is nothing related to sparse matrices. BLAS is strictly intended for dense numerical linear algebra.

与之相关的是 Netlib的F77参考文献 dgemm中使用的循环变体的区别.有两个循环变体用于安排矩阵-矩阵乘法op(A) * op(B)(这里的*表示矩阵乘法,而不是元素乘积!),并且该变体完全由op(A)的转置设置决定:

What it relates to is the difference in the loop variants used inside Netlib's F77 reference dgemm. There are two loop variants for scheduling matrix-matrix multiplications op(A) * op(B) (the * here denotes matrix multiplication not elementwise product!), and the variant is completely decided by the transpose setting of op(A):

  • 对于op(A) = A',在最内层的循环中使用内部产品"版本,在这种情况下,不可能进行零检测;
  • 对于op(A) = A,使用的是"AXPY"版本,并且op(B)中的任何零都可以从计算中排除.
  • for op(A) = A', "inner product" version is used in the innermost loop, in which case no zero detection is possible;
  • for op(A) = A, "AXPY" version is used, and any zero in op(B) can be excluded from computation.

现在考虑R如何调用dgemm.第一种情况对应于crossprod,第二种情况对应于"%*%"(以及tcrossprod).

Now think about how R calls dgemm. The first case corresponds to crossprod, while the second case to "%*%" (as well as tcrossprod).

在这方面,如果您的B矩阵有很多零,虽然它仍然是密集矩阵格式,那么t(A) %*% B会比crossprod(A, B)快,这仅仅是因为前者的循环变体效率更高.

In this aspect, if your B matrix have a lot of zeros, while it is still in dense matrix format, then t(A) %*% B will be faster than crossprod(A, B), simply because the loop variant for the former is more efficient.

最有启发性的例子是当B是对角矩阵或带状矩阵时.让我们考虑一个带状矩阵(这里实际上是一个对称的三对角矩阵):

The most illuminating example is when B is a diagonal matrix, or a banded matrix. Let's consider a banded matrix (actually a symmetric tri-diagonal matrix here):

B_diag <- diag(runif(1000))
B_subdiag <- rbind(0, cbind(diag(runif(999)), 0))
B <- B_diag + B_subdiag + t(B_subdiag)

现在让A仍然是完全密集的矩阵

Now let A still be a full dense matrix

A <- matrix(runif(1000 * 1000), 1000)

然后比较

library(microbenchmark)
microbenchmark(t.default(A) %*% B, crossprod(A, B))

您会看到"%*%"更快!

我猜一个更好的例子是矩阵B是一个三角形矩阵.在实践中这是相当普遍的.三角矩阵不被视为稀疏矩阵(也不应存储为稀疏矩阵).

I guess a better example is that matrix B is a triangular matrix. This is fairly common in practice. A triangular matrix is not seen as a sparse matrix (and should not be stored as a sparse one, either).

警告::如果您将R与经过优化的BLAS库一起使用,例如OpenBLAS或Intel MKL,您仍然会看到crossprod更快.但是,这确实是因为阻止&像Netlib的F77参考BLAS一样,任何优化的BLAS库中的缓存策略都会破坏循环变量调度模式.结果,不可能进行任何零检测".因此,您将观察到,对于此特定示例,F77参考BLAS甚至比优化的BLAS还要快.

Caution: If you are using R with an optimized BLAS library, like OpenBLAS or Intel MKL, you will still see that crossprod is faster. However, this is really because the blocking & caching strategy in any optimized BLAS libraries destroys the loop variants scheduling pattern as in Netlib's F77 reference BLAS. As a result, any "zero-detection" is impossible. So what you will observe is that for this particular example, F77 reference BLAS is even faster than optimized BLAS.

这篇关于什么时候将'crossprod'首选为'%*%',而什么时候不是呢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆