为什么采样矩阵行很慢? [英] Why sampling matrix row is very slow?
问题描述
我尝试进行一些引导并计算 colMeans
,自然地我选择了矩阵来存储数据,但是采样速度很慢:
I tried to do some bootstrapping and calculate colMeans
, naturally I chose matrix to store data, however, it is very slow in sampling:
m[sample(n,replace=TRUE),]
事实证明 data.table
是最快的。
require(microbenchmark)
require(data.table)
n = 2000
nc = 8000
m = matrix(1:(n*nc) ,nrow = n)
DF = as.data.frame(m)
DT = as.data.table(m)
s=sample(n, replace=TRUE)
microbenchmark(m[s,], DF[s,],DT[s,])
# Unit: milliseconds
# expr min lq mean median uq max neval
# m[s, ] 371.9271 402.3542 421.7907 420.8446 437.8251 506.1788 100
# DF[s, ] 182.3189 199.0865 218.0746 213.9451 231.1518 409.8625 100
# DT[s, ] 129.8225 139.1977 156.9506 150.4321 164.3104 254.2048 100
为什么采样矩阵比其他两个慢得多?
Why is sampling matrix much slower than the other two?
推荐答案
乍一看,在R的函数第265行上的MatrixSubset 。
这可能都不是。只是猜测。
It might be neither of these. Just guessing.
for (i = 0; i < nrs; i++) { // rows
...
for (j = 0; j < ncs; j++) { // columns
...
您的示例中有很多列(8,000)。每次内部循环获取新列时,都需要从RAM中将保存该值的RAM页面获取到缓存中(最有可能是L2)。下一次提取是另一列,因此不太可能重用L2中已经存在的页面。 矩阵
在内部是一个巨大的连续向量:第1列的所有列,然后是第2列的所有列,依此类推。页面获取相对昂贵。进入错误状态方向会导致太多页面提取。有关CPU缓存的更多信息,请此处。
Your example has a lot of columns (8,000). Each time the inner loop fetches a new column, it needs to fetch the page of RAM holding that value from RAM into cache (most likely L2). The next fetch is a different column and so it's less likely to be able to reuse a page already in L2. A matrix
is internally one huge contiguous vector: all of column 1 followed by all of column 2, etc. A page fetch is relatively expensive. Going in the "wrong" direction incurs too many page fetches. More about CPU cache here.
一个好的编译器应该自动执行循环交换,例如 gcc -floop-interchange
默认情况下处于启用状态。更多此处。在这种情况下,由于for循环内部内容的复杂性,可能无法进行优化。在这种情况下,也许是switch语句。也许您在操作系统上使用的R版本未使用带有该选项的编译器进行编译,或者未打开。
A good compiler should perform Loop interchange automatically such as gcc -floop-interchange
which is on by default. More here. This optimization might not be happening in this case due to the complexity of what's inside the for loops; perhaps in this case the switch statements. Or perhaps the version of R you're using on your OS wasn't compiled with a compiler with that option, or wasn't turned on.
矩阵
中的每个项目都发生了打开类型。即使 matrix
是单个类型!所以这很浪费。即使正在使用使用跳转表进行了优化,对于矩阵中的每个项目,跳转表可能仍在发生(可能是因为CPU可以预测切换)。由于您的示例 matrix
很小,只有61MB,因此我更倾向于将其作为罪魁祸首,而不是走错了方向。
The switch on type is happening on each and every item in the matrix
. Even though a matrix
is a single type! So this is wasteful. Even if the switch is being optimized with a jump table that jump table is probably still happening for every item in the matrix ('probably' because the CPU might predict the switch). Since your example matrix
is tiny at 61MB, I'm leaning more towards this being the culprit rather than going in the wrong direction.
// Check the row numbers once up front rather than 8,000 times.
// This is a contiguous sweep and therefore almost instant
// Declare variables i and ii locally for safety and maximum compiler optimizations
for (int i = 0; i < nrs; i++) {
int ii = INTEGER(sr)[i];
if (ii != NA_INTEGER && (ii < 1 || ii > nr))
errorcall(call, R_MSG_subs_o_b);
}
// Check the column numbers up front once rather than 2,000 times
for (int j = 0; j < ncs; j++) {
int jj = INTEGER(sc)[j];
if (jj != NA_INTEGER && (jj < 1 || jj > nc))
errorcall(call, R_MSG_subs_o_b);
}
// Now switch once on type rather than 8,000 * 2,000 times
// Loop column-by-column not row-by-row
int resi=0; // contiguous write to result (for page efficiency)
int ii, jj; // the current row and column, bounds checked above
switch (TYPEOF(x)) {
case LGLSXP: // the INTSXP will work for LGLSXP too, currently
case INTSXP:
for (int j=0; j<ncs; j++) { // column-by-column
jj = INTEGER(sc)[j];
for (int i=0; i<nrs; i++) { // within-this-column
ii = INTEGER(sr)[i];
INTEGER(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_INTEGER : INTEGER(x)[ii + jj * nr];
}
}
break;
case REALSXP:
for (int j=0; j<ncs; j++) {
jj = INTEGER(sc)[j];
for (int i=0; i<nrs; i++) {
ii = INTEGER(sr)[i];
REAL(result)[resi++] = (ii == NA_INTEGER || jj == NA_INTEGER) ? NA_REAL : REAL(x)[ii + jj * nr];
}
}
break;
case ...
如您所见,由于相同的<$ c在 switch()
情况下,必须不断重复$ c> for 个循环。代码可读性和健壮性的原因可能就是原始代码是这样的原因:R的实现中出现错字的可能性较小。已经证明了这一点,因为我懒于不为LOGICAL实施LGLSXP案例。我知道LOGICAL与当前在基数R中的INTEGER完全相同。但是将来可能会改变,因此如果LOGICAL确实发生变化(例如 char
而不是 int
以提高RAM效率)。
As you can see, there is more code this way because the same for
loops have to be repeated over and over within the switch()
cases. Code readability and robustness reasons may be why the original code is the way it is: there is less chance of a typo in R's implementation. That's already demonstrated because I was lazy in not implementing the LGLSXP case specially for LOGICAL. I know LOGICAL is exactly the same as INTEGER currently in base R. But that might change in future so my laziness (due to code bloat) could well cause a bug in R in future if LOGICAL does change (to say char
rather than int
for RAM efficiency).
解决代码膨胀问题的一种可能选择,请注意,实际上所有操作都是在移动内存。因此,所有类型(除STRSXP,VECSXP和EXPRSXP之外的类型)都可以使用带有类型大小的 memcpy
进行单个双循环。仍必须使用 SET_STRING_ELT
和 SET_VECTOR_ELT
来维护这些对象的引用计数。因此,这应该只是要维护的两个 for
循环的3次重复。另外,也可以将该成语包装到 #define
中,这是在R的其他部分中完成的。
One possible option to solve the code bloat problem, notice that all that's really happening is moving memory around. So all the types (other than STRSXP, VECSXP and EXPRSXP) can be done with a single double-for-loop using memcpy
with the type's size. SET_STRING_ELT
and SET_VECTOR_ELT
still must be used to maintain reference counts on those objects. So that should be just 3 repetitions of the double for
loops to maintain. Alternatively, that idiom can be wrapped into a #define
which is done in other parts of R.
最后,是否有可以在第一个边界检查循环中检测到传入的行或列中的NA(一种非常常见的情况,即不请求第NA个行或第NA个列!)。如果没有NA,则可以保存最深的三元((ii == NA_INTEGER || jj == NA_INTEGER)?:
)(2000 * 8000次对该分支的调用)通过把那个分支提高到外面。但是以更复杂的重复代码为代价。但是,也许分支预测可以在所有架构上可靠地投入使用,我们不必为此担心。
Finally, whether there are any NAs in the row or columns passed in (a very common case to not request the NA'th row or NA'th column!) can be detected in the first bounds checking loop. If there are no NAs then the deepest ternary ((ii == NA_INTEGER || jj == NA_INTEGER) ? :
) (2000 * 8000 calls to that branch) can be saved by raising that branch outside. But with the cost of more complex repeated code. However, perhaps branch prediction would kick in reliably on all architectures and we shouldn't worry about that.
data.table
既可以执行 memcpy
技巧,也可以进行深度分支保存有些但不是全部。它还开始逐列并行地子集化。但是在这种情况下,还不能只是因为它是新的并且仍在推出( setkey
非常相似并且已经是并行的)。自字符
和列表
列> SET_STRING_ELT 和 SET_VECTOR_ELT
在R中不是线程安全的。其他线程并行处理所有整数,实数,复数和原始列。然后,它的运行速度将达到内存io可以运行的速度。
data.table
does both the memcpy
trick and deep branch saving in some but not all places. It has also started to subset in parallel, column by column. But not in this case yet just because it's new and still being rolled out (setkey
is very similar and is already parallel). The master thread handles the character
and list
columns one by one though (not in parallel) since SET_STRING_ELT
and SET_VECTOR_ELT
are not thread-safe in R. The other threads handle all the integer, real, complex and raw columns in parallel. It then goes as fast as memory io can go.
我并没有真正看到您在61MB上看到的区别,但是通过增加10倍的列数将其扩展到(仍然很小)610MB到80,000,我确实看到了差异。
I don't really see the difference that you see on 61MB but scaling up to (still small) 610MB by increasing number of columns 10x to 80,000 I do see a difference.
n = 2000
nc = 8000 # same size as your example (61MB), on my laptop
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
expr min lq mean median uq max neval
m[s, ] 108.75182 112.11678 118.60111 114.58090 120.07952 168.6079 100
DF[s, ] 100.95019 105.88253 116.04507 110.84693 118.08092 163.9666 100
DT[s, ] 63.78959 69.07341 80.72039 72.69873 96.51802 136.2016 100
n = 2000
nc = 80000 # 10x bigger (610MB)
microbenchmark(m[s,], DF[s,],DT[s,])
Unit: milliseconds
expr min lq mean median uq max neval
m[s, ] 1990.3343 2010.1759 2055.9847 2032.9506 2057.2498 2733.278 100
DF[s, ] 1083.0373 1212.6633 1265.5346 1234.1558 1300.7502 2105.177 100
DT[s, ] 698.1295 830.3428 865.5918 862.5773 907.7225 1053.393 100
我有128MB的L4缓存。我想您的缓存较少。整个61MB都适合我的L4缓存,因此我真的没有注意到该大小的缓存效率低。
I have 128MB of L4 cache, though. I guess you have less cache. The entire 61MB fits in my L4 cache so I don't really notice the cache inefficiency at that size.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 70
Model name: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
Stepping: 1
CPU MHz: 3345.343
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 5587.63
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 6144K
L4 cache: 131072K
NUMA node0 CPU(s): 0-7
这篇关于为什么采样矩阵行很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!