R:比较矩阵中的字段 [英] R: Comparing fields in matrix

查看:170
本文介绍了R:比较矩阵中的字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据帧要比较:
如果两个数据帧中的特定位置满足要求,则在单独的数据帧中为该特定位置分配X。



如何以有效的方式获得预期的输出?真实的数据框包含1000列,数千行到数百万行。
我认为 data.table 将是最快的选项,但我没有掌握 data.table 尚未完成



预期输出:

  print(result)
#[,1] [,2] [,3] [,4] [,5] [,6] [1,]「A」「A」「O」「X」「X」「X」「X」「O」「O」
#[2,]「A」「A」「O」 XXXXOO
#[3,]AAO X

我的代码:

  df1 < - 结构(c(1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,2,2 ,
2,2,2,2,3,3,3,2,0,1),.Dim = c(3L,9L),.Dimnames = list(
c(A B,C),NULL))
df2< - 结构(c(1,1,1,2,2,2,3,3,3,1,1,1,1) 1,1,2,2,
2,2,2,2,1,3,3,4,4,2),.Dim = c(3L,9L),.Dimnames = list $ bc(A,B,C),NULL))

结果< - matrix(O,nrow(df1),ncol(df1))


for(i in 1:nrow(df1))
{
for(j in 3:ncol(df1))
{
result [i,1] = c(A)
result [i,2] = c(A)
if(is.na(df1 [i,j])|| is。如果(!is.na(df1 [i,j]),则返回结果, j])& !is.na(df2 [i,j])&& %is(df1 [i,j]%in%c(0,1,2 )& df2 [i,j]%in%c(0,1,2)){
result [i,j] b}
}
}
}


print(result)

编辑



我喜欢@ David和@ Heroka的解决方案。
在一个小数据集上,Heroka的解决方案的速度是原始速度的125倍,而David的速度是29倍。
这是基准:

 > mbm 
单位:毫秒
expr min lq平均值中位数uq最大值neval
原始1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079 100
Heroka 8.46317 8.711986 9.03517 8.914616 9.067793 18.06716 100
DavidAarenburg )35.58350 36.660565 39.85823 37.061160 38.175700 53.83976 100

感谢alot guys!



一种方法可能是使用ifelse(和%in%一个数字变量,
节省大约50%的时间来避免时间转换。

  result< -  ifelse(is.na(df1)| is.na(df2),N,
ifelse(df1%in%0:2& df2%in%0:2,X,O ))
result [,1:2]< - A
result


$ b b

感谢@DavidArenburg,更快的速度改善

  nrow(df1),ncol = ncol(df1))
result [is.na(df1)| is.na(df2)]< - N
result [df1< 3& df2 < 3]< - X
result [,1:2]< - A


I've got two data frames I want to compare: If a specific location in both data frames meet a requirement assign "X" to that specific location in a seperate data frame.

How can I get the expected output in an efficient way? The real data frame contains 1000 columns with thousands to millions of rows. I think data.table would be the quickest option, but I don't have a grasp of how data.table works yet

Expected output:

> print(result)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "O" 
# [2,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "O" 
# [3,] "A"  "A"  "O"  "X"  "X"  "X"  "X"  "O"  "X" 

My code:

df1 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 
            2, 2, 2, 2, 3, 3, 3, 2, 0, 1), .Dim = c(3L, 9L), .Dimnames = list(
              c("A", "B", "C"), NULL))
df2 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2, 
            2, 2, 2, 2, 1, 3, 3, 4, 4, 2), .Dim = c(3L, 9L), .Dimnames = list(
              c("A", "B", "C"), NULL))

result <- matrix("O", nrow(df1), ncol(df1))


for (i in 1:nrow(df1)) 
{
  for (j in 3:ncol(df1)) 
  {
    result[i,1] = c("A")
    result[i,2] = c("A")
    if (is.na(df1[i,j]) || is.na(df2[i,j])){
      result[i,j] <- c("N")
    }
    if (!is.na(df1[i,j]) && !is.na(df2[i,j]) && !is.na(df2[i,j]))
    {

      if (df1[i,j] %in% c("0","1","2") & df2[i,j] %in% c("0","1","2")) {
        result[i,j] <- c("X") 
      }
    }
  }
}   


print(result)

Edit

I like both @David's and @Heroka's solutions. On a small dataset, Heroka's solution is 125x as fast as the original, and David's is 29 times as fast. Here's the benchmark:

> mbm
Unit: milliseconds
             expr        min          lq       mean      median          uq        max neval
         original 1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079   100
           Heroka    8.46317    8.711986    9.03517    8.914616    9.067793   18.06716   100
 DavidAarenburg()   35.58350   36.660565   39.85823   37.061160   38.175700   53.83976   100

Thanks alot guys!

解决方案

You have matrices, not dataframes.

One approach might be to use ifelse (and %in% a numeric variable, saves about 50% of the time to avoid the time-conversion.:

  result <- ifelse(is.na(df1)|is.na(df2),"N",
                   ifelse(df1 %in% 0:2 & df2 %in% 0:2,"X","O"))
  result[,1:2] <- "A"
  result

With thanks to @DavidArenburg, more improvement in speed

result <- matrix("O",nrow=nrow(df1),ncol=ncol(df1))
result[is.na(df1) | is.na(df2)] <- "N"
result[df1 < 3 & df2 < 3] <- "X"
result[, 1:2] <- "A"

这篇关于R:比较矩阵中的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆