R:比较矩阵中的字段 [英] R: Comparing fields in matrix
问题描述
我有两个数据帧要比较:
如果两个数据帧中的特定位置满足要求,则在单独的数据帧中为该特定位置分配X。
如何以有效的方式获得预期的输出?真实的数据框
包含1000列,数千行到数百万行。
我认为 data.table
将是最快的选项,但我没有掌握 data.table
尚未完成
预期输出:
print(result)
#[,1] [,2] [,3] [,4] [,5] [,6] [1,]「A」「A」「O」「X」「X」「X」「X」「O」「O」
#[2,]「A」「A」「O」 XXXXOO
#[3,]AAO X
我的代码:
df1 < - 结构(c(1,1,1,2,2,2,3,3,3,1,1,1,1,1,1,1,2,2 ,
2,2,2,2,3,3,3,2,0,1),.Dim = c(3L,9L),.Dimnames = list(
c(A B,C),NULL))
df2< - 结构(c(1,1,1,2,2,2,3,3,3,1,1,1,1) 1,1,2,2,
2,2,2,2,1,3,3,4,4,2),.Dim = c(3L,9L),.Dimnames = list $ bc(A,B,C),NULL))
结果< - matrix(O,nrow(df1),ncol(df1))
for(i in 1:nrow(df1))
{
for(j in 3:ncol(df1))
{
result [i,1] = c(A)
result [i,2] = c(A)
if(is.na(df1 [i,j])|| is。如果(!is.na(df1 [i,j]),则返回结果, j])& !is.na(df2 [i,j])&& %is(df1 [i,j]%in%c(0,1,2 )& df2 [i,j]%in%c(0,1,2)){
result [i,j] b}
}
}
}
print(result)
编辑
我喜欢@ David和@ Heroka的解决方案。
在一个小数据集上,Heroka的解决方案的速度是原始速度的125倍,而David的速度是29倍。
这是基准:
> mbm
单位:毫秒
expr min lq平均值中位数uq最大值neval
原始1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079 100
Heroka 8.46317 8.711986 9.03517 8.914616 9.067793 18.06716 100
DavidAarenburg )35.58350 36.660565 39.85823 37.061160 38.175700 53.83976 100
感谢alot guys!
一种方法可能是使用ifelse(和%in%一个数字变量,
节省大约50%的时间来避免时间转换。
result< - ifelse(is.na(df1)| is.na(df2),N,
ifelse(df1%in%0:2& df2%in%0:2,X,O ))
result [,1:2]< - A
result
$ b b
感谢@DavidArenburg,更快的速度改善
nrow(df1),ncol = ncol(df1))
result [is.na(df1)| is.na(df2)]< - N
result [df1< 3& df2 < 3]< - X
result [,1:2]< - A
I've got two data frames I want to compare: If a specific location in both data frames meet a requirement assign "X" to that specific location in a seperate data frame.
How can I get the expected output in an efficient way? The real data frame
contains 1000 columns with thousands to millions of rows.
I think data.table
would be the quickest option, but I don't have a grasp of how data.table
works yet
Expected output:
> print(result)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "A" "A" "O" "X" "X" "X" "X" "O" "O"
# [2,] "A" "A" "O" "X" "X" "X" "X" "O" "O"
# [3,] "A" "A" "O" "X" "X" "X" "X" "O" "X"
My code:
df1 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 3, 3, 3, 2, 0, 1), .Dim = c(3L, 9L), .Dimnames = list(
c("A", "B", "C"), NULL))
df2 <- structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 1, 3, 3, 4, 4, 2), .Dim = c(3L, 9L), .Dimnames = list(
c("A", "B", "C"), NULL))
result <- matrix("O", nrow(df1), ncol(df1))
for (i in 1:nrow(df1))
{
for (j in 3:ncol(df1))
{
result[i,1] = c("A")
result[i,2] = c("A")
if (is.na(df1[i,j]) || is.na(df2[i,j])){
result[i,j] <- c("N")
}
if (!is.na(df1[i,j]) && !is.na(df2[i,j]) && !is.na(df2[i,j]))
{
if (df1[i,j] %in% c("0","1","2") & df2[i,j] %in% c("0","1","2")) {
result[i,j] <- c("X")
}
}
}
}
print(result)
Edit
I like both @David's and @Heroka's solutions. On a small dataset, Heroka's solution is 125x as fast as the original, and David's is 29 times as fast. Here's the benchmark:
> mbm
Unit: milliseconds
expr min lq mean median uq max neval
original 1058.81826 1110.481659 1131.81711 1112.848211 1124.775989 1428.18079 100
Heroka 8.46317 8.711986 9.03517 8.914616 9.067793 18.06716 100
DavidAarenburg() 35.58350 36.660565 39.85823 37.061160 38.175700 53.83976 100
Thanks alot guys!
You have matrices, not dataframes.
One approach might be to use ifelse (and %in% a numeric variable, saves about 50% of the time to avoid the time-conversion.:
result <- ifelse(is.na(df1)|is.na(df2),"N",
ifelse(df1 %in% 0:2 & df2 %in% 0:2,"X","O"))
result[,1:2] <- "A"
result
With thanks to @DavidArenburg, more improvement in speed
result <- matrix("O",nrow=nrow(df1),ncol=ncol(df1))
result[is.na(df1) | is.na(df2)] <- "N"
result[df1 < 3 & df2 < 3] <- "X"
result[, 1:2] <- "A"
这篇关于R:比较矩阵中的字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!