加速 R 中成对观察的计数 [英] Speeding up count of pairwise observations in R

查看:45
本文介绍了加速 R 中成对观察的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中每个条目的测量值子集随机丢失:

I have a dataset where a subset of measurements for each entry are randomly missing:

dat <- matrix(runif(100), nrow=10)
rownames(dat) <- letters[1:10]
colnames(dat) <- paste("time", 1:10)
dat[sample(100, 25)] <- NA

我有兴趣计算此数据集中每一行之间的相关性(即 a-a、a-b、a-c、a-d、...).但是,我想通过在结果相关矩阵中将它们的值设置为 NA 来排除少于 5 个成对非 NA 观测值的相关性.

I am interested in calculating correlations between each row in this dataset (i.e., a-a, a-b, a-c, a-d, ...). However, I would like to exclude correlations where there are fewer than 5 pairwise non-NA observations by setting their value to NA in the resulting correlation matrix.

目前我是这样做的:

cor <- cor(t(dat), use = 'pairwise.complete.obs')
names <- rownames(dat)
filter <- sapply(names, function(x1) sapply(names, function(x2) 
    sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5))
cor[filter] <- NA

但是,此操作非常慢,因为实际数据集包含 >1,000 个条目.

However, this operation is very slow as the actual dataset contains >1,000 entries.

有没有办法基于非 NA 成对观察的数量以矢量化方式过滤单元格,而不是在嵌套循环中?

Is there way to filter cells based on the number of non-NA pairwise observations in a vectorized manner, instead of within nested loops?

推荐答案

您可以使用矩阵方法计算非 NA 成对观察的数量.

You can count the number of non-NA pairwise observations using matrix approach.

让我们使用这个数据生成代码.我放大了数据并添加了更多 NA.

Let's use this data generation code. I made data larger and added more NAs.

nr = 1000;
nc = 900;
dat = matrix(runif(nr*nc), nrow=nr)
rownames(dat) = paste(1:nr)
colnames(dat) = paste("time", 1:nc)
dat[sample(nr*nc, nr*nc*0.9)] = NA

然后你过滤代码需要 85 秒

Then you filter code is taking 85 seconds

tic = proc.time()
names = rownames(dat)
filter = sapply(names, function(x1) sapply(names, function(x2) 
    sum(!is.na(dat[x1,]) & !is.na(dat[x2,])) < 5));
toc = proc.time();
show(toc-tic);
# 85.50 seconds

我的版本为原始数据中的非 NA 创建了一个值为 1 的矩阵.然后使用矩阵乘法计算成对非 NA 的数量.它在几分之一秒内运行.

My version creates a matrix with values 1 for non-NAs in the original data. Then using matrix multiplication I calculate number of pairwise non-NAs. It ran in a fraction of a second.

tic = proc.time()
NAmat = matrix(0, nrow = nr, ncol = nc)
NAmat[ !is.na(dat) ] = 1;
filter2 = (tcrossprod(NAmat) < 5)
toc = proc.time();
show(toc-tic);
# 0.09 seconds

简单的检查显示结果是一样的:

Simple check shows the results are the same:

all(filter == filter2)
# TRUE

这篇关于加速 R 中成对观察的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆