为什么采样矩阵行很慢? [英] Why sampling matrix row is very slow?

查看:104
本文介绍了为什么采样矩阵行很慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试进行一些引导并计算 colMeans ,自然地我选择了矩阵来存储数据,但是采样速度很慢:

I tried to do some bootstrapping and calculate colMeans, naturally I chose matrix to store data, however, it is very slow in sampling:

m[sample(n,replace=TRUE),]

事实证明 data.table 是最快的。

require(microbenchmark)
require(data.table)
n = 2000
nc = 8000
m = matrix(1:(n*nc) ,nrow = n)
DF = as.data.frame(m)
DT = as.data.table(m)

s=sample(n, replace=TRUE)
microbenchmark(m[s,], DF[s,],DT[s,])

# Unit: milliseconds
    # expr      min       lq     mean   median       uq      max neval
  # m[s, ] 371.9271 402.3542 421.7907 420.8446 437.8251 506.1788   100
 # DF[s, ] 182.3189 199.0865 218.0746 213.9451 231.1518 409.8625   100
 # DT[s, ] 129.8225 139.1977 156.9506 150.4321 164.3104 254.2048   100

为什么采样矩阵比其他两个慢得多?

Why is sampling matrix much slower than the other two?

推荐答案

乍一看,在R的函数第265行上的MatrixSubset

这可能都不是。只是猜测。

It might be neither of these. Just guessing.

for (i = 0; i < nrs; i++) {    // rows
  ...
  for (j = 0; j < ncs; j++) {  // columns
    ...

您的示例中有很多列(8,000)。每次内部循环获取新列时,都需要从RAM中将保存该值的RAM页面获取到缓存中(最有可能是L2)。下一次提取是另一列,因此不太可能重用L2中已经存在的页面。 矩阵在内部是一个巨大的连续向量:第1列的所有列,然后是第2列的所有列,依此类推。页面获取相对昂贵。进入错误状态方向会导致太多页面提取。有关CPU缓存的更多信息,请此处

Your example has a lot of columns (8,000). Each time the inner loop fetches a new column, it needs to fetch the page of RAM holding that value from RAM into cache (most likely L2). The next fetch is a different column and so it's less likely to be able to reuse a page already in L2. A matrix is internally one huge contiguous vector: all of column 1 followed by all of column 2, etc. A page fetch is relatively expensive. Going in the "wrong" direction incurs too many page fetches. More about CPU cache here.

一个好的编译器应该自动执行循环交换,例如 gcc -floop-interchange 默认情况下处于启用状态。更多此处。在这种情况下,由于for循环内部内容的复杂性,可能无法进行优化。在这种情况下,也许是switch语句。也许您在操作系统上使用的R版本未使用带有该选项的编译器进行编译,或者未打开。

A good compiler should perform Loop interchange automatically such as gcc -floop-interchange which is on by default. More here. This optimization might not be happening in this case due to the complexity of what's inside the for loops; perhaps in this case the switch statements. Or perhaps the version of R you're using on your OS wasn't compiled with a compiler with that option, or wasn't turned on.

矩阵中的每个项目都发生了打开类型。即使 matrix 是单个类型!所以这很浪费。即使正在使用使用跳转表进行了优化,对于矩阵中的每个项目,跳转表可能仍在发生(可能是因为CPU可以预测切换)。由于您的示例 matrix 很小,只有61MB,因此我更倾向于将其作为罪魁祸首,而不是走错了方向。

The switch on type is happening on each and every item in the matrix. Even though a matrix is a single type! So this is wasteful. Even if the switch is being optimized with a jump table that jump table is probably still happening for every item in the matrix ('probably' because the CPU might predict the switch). Since your example matrix is tiny at 61MB, I'm leaning more towards this being the culprit rather than going in the wrong direction.

// Check the row numbers once up front rather than 8,000 times.
// This is a contiguous sweep and therefore almost instant
// Declare variables i and ii locally for safety and maximum compiler optimizations
for (int i = 0; i < nrs; i++) {
  int ii = INTEGER(sr)[i];
  if (ii != NA_INTEGER && (ii < 1 || ii > nr))
    errorcall(call, R_MSG_subs_o_b);
}

// Check the column numbers up front once rather than 2,000 times
for (int j = 0; j < ncs; j++) {
  int jj = INTEGER(sc)[j];
  if (jj != NA_INTEGER && (jj < 1 || jj > nc))
    errorcall(call, R_MSG_subs_o_b);
}

// Now switch once on type rather than 8,000 * 2,000 times
// Loop column-by-column not row-by-row

int resi=0;  // contiguous write to result (for page efficiency)
int ii, jj;  // the current row and column, bounds checked above
switch (TYPEOF(x)) {
  case LGLSXP:  // the INTSXP will work for LGLSXP too, currently
  case INTSXP:
    for (int j=0; j<ncs; j++) {  // column-by-column
      jj = INTEGER(sc)[j];
      for (int i=0; i<nrs; i++) {  // within-this-column
        ii = INTEGER(sr)[i];
        INTEGER(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_INTEGER : INTEGER(x)[ii + jj * nr];
      }
    }
    break;
  case REALSXP:
    for (int j=0; j<ncs; j++) {
      jj = INTEGER(sc)[j];
      for (int i=0; i<nrs; i++) {
        ii = INTEGER(sr)[i];
        REAL(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_REAL : REAL(x)[ii + jj * nr];
      }
    }
    break;
  case ...

如您所见,由于相同的<$ c在 switch()情况下,必须不断重复$ c> for 个循环。代码可读性和健壮性的原因可能就是原始代码是这样的原因:R的实现中出现错字的可能性较小。已经证明了这一点,因为我懒于不为LOGICAL实施LGLSXP案例。我知道LOGICAL与当前在基数R中的INTEGER完全相同。但是将来可能会改变,因此如果LOGICAL确实发生变化(例如 char 而不是 int 以提高RAM效率)。

As you can see, there is more code this way because the same for loops have to be repeated over and over within the switch() cases. Code readability and robustness reasons may be why the original code is the way it is: there is less chance of a typo in R's implementation. That's already demonstrated because I was lazy in not implementing the LGLSXP case specially for LOGICAL. I know LOGICAL is exactly the same as INTEGER currently in base R. But that might change in future so my laziness (due to code bloat) could well cause a bug in R in future if LOGICAL does change (to say char rather than int for RAM efficiency).

解决代码膨胀问题的一种可能选择,请注意,实际上所有操作都是在移动内存。因此,所有类型(除STRSXP,VECSXP和EXPRSXP之外的类型)都可以使用带有类型大小的 memcpy 进行单个双循环。仍必须使用 SET_STRING_ELT SET_VECTOR_ELT 来维护这些对象的引用计数。因此,这应该只是要维护的两个 for 循环的3次重复。另外,也可以将该成语包装到 #define 中,这是在R的其他部分中完成的。

One possible option to solve the code bloat problem, notice that all that's really happening is moving memory around. So all the types (other than STRSXP, VECSXP and EXPRSXP) can be done with a single double-for-loop using memcpy with the type's size. SET_STRING_ELT and SET_VECTOR_ELT still must be used to maintain reference counts on those objects. So that should be just 3 repetitions of the double for loops to maintain. Alternatively, that idiom can be wrapped into a #define which is done in other parts of R.

最后,是否有可以在第一个边界检查循环中检测到传入的行或列中的NA(一种非常常见的情况,即不请求第NA个行或第NA个列!)。如果没有NA,则可以保存最深的三元((ii == NA_INTEGER || jj == NA_INTEGER)?:)(2000 * 8000次对该分支的调用)通过把那个分支提高到外面。但是以更复杂的重复代码为代价。但是,也许分支预测可以在所有架构上可靠地投入使用,我们不必为此担心。

Finally, whether there are any NAs in the row or columns passed in (a very common case to not request the NA'th row or NA'th column!) can be detected in the first bounds checking loop. If there are no NAs then the deepest ternary ((ii == NA_INTEGER || jj == NA_INTEGER) ? :) (2000 * 8000 calls to that branch) can be saved by raising that branch outside. But with the cost of more complex repeated code. However, perhaps branch prediction would kick in reliably on all architectures and we shouldn't worry about that.

data.table 既可以执行 memcpy 技巧,也可以进行深度分支保存有些但不是全部。它还开始逐列并行地子集化。但是在这种情况下,还不能只是因为它是新的并且仍在推出( setkey 非常相似并且已经是并行的)。自字符和列表列> SET_STRING_ELT 和 SET_VECTOR_ELT 在R中不是线程安全的。其他线程并行处理所有整数,实数,复数和原始列。然后,它的运行速度将达到内存io可以运行的速度。

data.table does both the memcpy trick and deep branch saving in some but not all places. It has also started to subset in parallel, column by column. But not in this case yet just because it's new and still being rolled out (setkey is very similar and is already parallel). The master thread handles the character and list columns one by one though (not in parallel) since SET_STRING_ELT and SET_VECTOR_ELT are not thread-safe in R. The other threads handle all the integer, real, complex and raw columns in parallel. It then goes as fast as memory io can go.

我并没有真正看到您在61MB上看到的区别,但是通过增加10倍的列数将其扩展到(仍然很小)610MB到80,000,我确实看到了差异。

I don't really see the difference that you see on 61MB but scaling up to (still small) 610MB by increasing number of columns 10x to 80,000 I do see a difference.

n = 2000
nc = 8000    # same size as your example (61MB), on my laptop
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
    expr       min        lq      mean    median        uq      max neval
  m[s, ] 108.75182 112.11678 118.60111 114.58090 120.07952 168.6079   100
 DF[s, ] 100.95019 105.88253 116.04507 110.84693 118.08092 163.9666   100
 DT[s, ]  63.78959  69.07341  80.72039  72.69873  96.51802 136.2016   100

n = 2000
nc = 80000     # 10x bigger (610MB)
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
    expr       min        lq      mean    median        uq      max neval
  m[s, ] 1990.3343 2010.1759 2055.9847 2032.9506 2057.2498 2733.278   100
 DF[s, ] 1083.0373 1212.6633 1265.5346 1234.1558 1300.7502 2105.177   100
 DT[s, ]  698.1295  830.3428  865.5918  862.5773  907.7225 1053.393   100

我有128MB的L4缓存。我想您的缓存较少。整个61MB都适合我的L4缓存,因此我真的没有注意到该大小的缓存效率低。

I have 128MB of L4 cache, though. I guess you have less cache. The entire 61MB fits in my L4 cache so I don't really notice the cache inefficiency at that size.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 70
Model name:            Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Stepping:              1
CPU MHz:               3345.343
CPU max MHz:           4000.0000
CPU min MHz:           800.0000
BogoMIPS:              5587.63
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
L4 cache:              131072K
NUMA node0 CPU(s):     0-7

这篇关于为什么采样矩阵行很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆