R合并循环性能 [英] R merged loop performance

查看:71
本文介绍了R合并循环性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有4000列的2000行数据。我想做的是将每一行与其余行进行比较,看看它们在不同的列/总列方面有多相似。

I have 2000 rows of data for 4000 columns. What I'm trying to do is to compare each row to the rest of the rows and see how similar they are in terms of different columns/total columns.

到目前为止,我所做的如下:

What I did so far is as follows:

for (i in 1:nrow(data))
{
    for (j in (i+1):nrow(data))
    { 
        mycount[[i,j]] = length(which(data[i,] != data[j,]))
    }
}

它有2个问题,j不是从i + 1开始(这可能是一个基本错误)
但是主要的问题是时间

There are 2 problems with it, j doesn't start from i+1 (which is probably a basic mistake) The main problem however is time it consumes, it takes ages...

有人可以建议一种更合适的方法来获得相同的结果,即每一行与其他行相似的百分比吗?

Could someone please suggest a more proper way to achieve the same result, result being the percentage of each rows similarity to the other rows?

以下是数据和我要实现的示例:

Here's an example of data and what I want to achieve:

输出应类似于:

mycount[1,2] = 2 (S# and var3 columns are different)
mycount[1,3] = 2 (S# and var1 columns are different)
mycount[1,4] = 2 (S# and var4 columns are different)
mycount[2,3] = ...
mycount[2,4] = ...
mycount[3,4] =  3 (S#, var1 and var 4 are different)


推荐答案

代码中的一个问题是 mycount [[i]] 的值在 j 循环的每次迭代中都会更新(先前的值将被覆盖),因此最终得到的是 mycount [[i]] 等于 length(which(data [i,]!= data [nrow (数据),]))。另一个问题是 i + 1:nrow(data)不会产生数字 i + 1,i + 2,... nrow(data ),但 i +(1:nrow(data))。因此,您想要的是(i + 1):nrow(data) seq(i + 1,nrow(data))

One problem in your code is that the value of mycount[[i]] is updated in each iteration of the j loop (the previous value is overwritten) so what you end up with is mycount[[i]] being equal to length(which(data[i,] != data[nrow(data),])). Another issue is that i+1:nrow(data) does not produce the numbers i+1, i+2, ... nrow(data) but i + (1:nrow(data)). So what you want is either (i + 1):nrow(data) or seq(i + 1, nrow(data)).

您可以尝试以下代码,它比double循环要快(不过可能仍然太慢)

You can try the following code, which will be faster than the double loop (probably still too slow though)

rows <- lapply(seq(nrow(data)), function(i) data[i, ])
outer(X = rows, Y = rows, FUN = Vectorize(function(x, y) sum(x == y)))

这篇关于R合并循环性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆