对于R中的大型矩阵,如何有效地计算所有可能组合中的归一化比率? [英] How to calculate normalized ratios in all possible combinations efficiently for a large matrix in R?
问题描述
我想为R中的一个大矩阵有效地计算所有可能组合中的归一化比率。我早些时候曾问过类似的问题在此处,并提供少量数据,并且那里提供的解决方案效果很好。但是,当我尝试对大型数据集(400 x 2151)应用相同的解决方案时,我的系统挂起了。我的系统的Intel i7处理器具有16 GB RAM。这是带有数据的代码
I want to calculate normalised ratios in all possible combinations efficiently for a large matrix in R. I have asked a similar question earlier here and with a small data and the solutions provided there worked fine. But when I am trying to apply the same solution for a large dataset (400 x 2151), my system is getting hang. My system is having 16 GB RAM with Intel i7 processer. Here is the code with data
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
@Ronak Shah提供的解决方案
Solution provided by @Ronak Shah
cols <- 1:ncol(df)
temp <- expand.grid(cols, cols)
new_data <- (df[,temp[,2]] - df[,temp[,1]])/(df[,temp[,2]] + df[,temp[,1]])
@akrun
f1 <- function(i, j) (df[, i] - df[, j])/(df[, i] + df[, j])
out <- outer(seq_along(df), seq_along(df), FUN = f1)
colnames(out) <- outer(names(df), names(df), paste, sep = "_")
两个解决方案都取一个时间过长,系统开始挂起。那么,我该如何有效地做到这一点呢?
Both the solutions taking a very long time and the system is getting hang. So, how can I efficiently do it?
推荐答案
既然内存似乎是您的主要问题,那么如何使用迭代器呢?使用软件包 RcppAlgos
* ,我们可以使用 permuteIter
一次计算您的比率 N 。
Since memory seems to be your main issue, how about using iterators? Using the package RcppAlgos
*, we can make use of permuteIter
to calculate your ratios N at a time.
如果必须有名称,我们需要一个附加的迭代器。这意味着您必须使2个迭代器保持同步,这可能会变得乏味。幸运的是,使用 permuteIter
的 summary()
方法,我们始终可以看到当前索引是什么,并使用多种选择(例如随机访问 [[
, front()
, back( )
或 startOver()
)。
If one must have names, we need an additional iterator. This means you must keep 2 iterators in sync, which can become tedious. Fortunately, with the summary()
methods of permuteIter
, we can always see what the current index is and reset them with a wide range of options (e.g. random access [[
, front()
, back()
, or startOver()
).
library(RcppAlgos)
df <- matrix(rexp(860400), nrow = 400, ncol = 2151)
ratioIter <- permuteIter(ncol(df), 2, FUN = function(x) {
(df[, x[2]] - df[, x[1]]) / (df[, x[2]] + df[, x[1]])
})
## if you really want to name your output, you must have
## an additional name iterator... not very elegant
nameIter <- permuteIter(paste0("col", 1:ncol(df1)), 2, FUN = function(x) {
paste0(rev(x), collapse = "_")
})
firstIter <- matrix(ratioIter$nextIter(), ncol = 1)
firstName <- nameIter$nextIter()
colnames(firstIter) <- firstName
head(firstIter)
col2_col1
[1,] 0.2990054
[2,] -0.9808111
[3,] -0.9041054
[4,] 0.7970873
[5,] 0.8625776
[6,] 0.2768359
## returns a list, so we call do.call(cbind
next5Iter <- do.call(cbind, ratioIter$nextNIter(5))
next5Names <- unlist(nameIter$nextNIter(5))
colnames(next5Iter) <- next5Names
head(next5Iter)
col3_col1 col4_col1 col5_col1 col6_col1 col7_col1
[1,] -0.28099710 0.1665687 0.40565958 -0.7524038 -0.7132844
[2,] -0.81434900 -0.4283759 -0.89811556 -0.8462906 -0.5399741
[3,] -0.02289368 0.4285012 0.05087853 -0.5091659 -0.2328995
[4,] -0.06825458 0.3126928 0.68968843 -0.2180618 0.6651785
[5,] 0.33508319 0.7389108 0.84733425 0.9065263 0.8977107
[6,] 0.61773589 0.3443120 0.61084584 0.5727938 0.3888807
您应该注意,这不会显示 i == j
的结果(这些给出 NaN
)。因此总数不超过2151 2 (实际上,它等于 2151 ^ 2-2151
)。
You should note that this does not show results where i == j
(these give NaN
). So the total number is just under 21512 (In fact it is exactly equal to 2151^2 - 2151
).
ratioIter$summary()
$description
[1] "Permutations of 2151 choose 2"
$currentIndex
[1] 6
$totalResults
[1] 4624650
$totalRemaining
[1] 4624644
甚至还有随机访问和以前的迭代器:
There are even random access and previous iterators as well:
## Get the last ratio
lastIter <- ratioIter$back()
lastName <- nameIter$back()
mLast <- matrix(lastIter, ncol = 1)
colnames(mLast) <- lastName
head(mLast)
col2150_col2151
[1,] -0.6131926
[2,] 0.9936783
[3,] 0.1373538
[4,] 0.1014347
[5,] -0.5061608
[6,] 0.5773503
## iterate backwards with the previous methods
prev5Iter <- do.call(cbind, ratioIter$prevNIter(5))
prev5Names <- unlist(nameIter$prevNIter(5))
colnames(prev5Iter) <- prev5Names
head(prev5Iter)
col2149_col2151 col2148_col2151 col2147_col2151 col2146_col2151 col2145_col2151
[1,] -0.75500069 -0.72757136 -0.94457988 -0.82858884 -0.25398782
[2,] 0.99696694 0.99674084 0.99778638 0.99826472 0.95738947
[3,] 0.27701596 0.45696010 0.00682574 0.01529448 -0.62368764
[4,] -0.09508689 -0.90698165 -0.38221934 -0.41405984 0.01371556
[5,] -0.31580709 -0.06561386 -0.07435058 -0.08033145 -0.90692881
[6,] 0.82697720 0.86858595 0.81707206 0.75627297 0.46272349
## Get a random sample
set.seed(123)
randomIter <- do.call(cbind, ratioIter[[sample(4624650, 5)]])
## We must reset the seed in order to get the same output for the names
set.seed(123)
randomNames <- unlist(nameIter[[sample(4624650, 5)]])
colnames(randomIter) <- randomNames
head(randomIter)
col1044_col939 col20_col1552 col412_col2014 col1751_col1521 col337_col1295
[1,] -0.3902066 0.4482747 -0.108018200 -0.1662857 -0.3822436
[2,] -0.2358101 0.9266657 -0.657135882 0.0671608 -0.6821823
[3,] -0.7054217 0.8944720 0.092363665 0.2667708 0.1908249
[4,] -0.1574657 0.2775225 -0.221737223 0.3381454 -0.5705021
[5,] -0.4282909 -0.4406433 0.092783086 -0.7506674 -0.1276932
[6,] 0.9998189 -0.2497586 -0.009375891 0.7071864 -0.2425258
最后,它是用 C ++
编写的,因此速度非常快:
Lastly, it is written in C++
so it is very fast:
system.time(ratioIter$nextNIter(1e3))
# user system elapsed
# 0 0 0
* 我是 RcppAlgos
这篇关于对于R中的大型矩阵,如何有效地计算所有可能组合中的归一化比率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!