通过匹配其列来合并两个不同大小的数据帧 [英] Merging two data frames with different sizes by matching their columns

查看:73
本文介绍了通过匹配其列来合并两个不同大小的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果列X和Y等于(我必须匹配 dOne.X == dTwo.X& dOne),我正在尝试将数据框的列V合并在另一个列中。 Y == dTwo.Y dOne.X == dTwo.Y& dOne.Y == dTwo.X
我使用循环的解决了这个问题,但是当Data Frame dOne很大时(在我的机器中需要25分钟,如果 length( dOne.X)== 500000 )。我想知道是否有办法使用更快的矢量化操作来解决这个问题。以上是我想做的一个例子:

 数据框ONE 
XYV
ab 2
ac 3
ad 0
ae 0
bc 2
bd 3
be 0
cd 2
ce 0
de 0

数据帧TWO
XYV
ab 1
ac 1
ad 1
bc 1
bd 1
cd 1
ed 1

列合并后的预期数据框
XYV V2
ab 2 1
ac 3 1
ad 0 1
ae 0 0
bc 2 1
bd 3 1
be 0 0
cd 2 1
ce 0 0
de 0 1

这是迄今为止使用的代码,当dOne大(数十万或数行)时,该代码很慢:

  copyadjlistValueColumn<  -  function(dOne,dTwo){
dOne $ V2< - 0

lv< - union(levels(dOne $ Y),levels(dOne $ X))

dTwo $ X<因子(dTwo $ X,levels = lv)
dTwo $ Y <因子(dTwo $ Y,levels = lv)
dOne $ X<因子(dOne ($ 1,n,(dTwo))$ {$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ b row < - dTwo [i,]
dOne $ V2 [dOne $ X == row $ X& dOne $ Y == row $ Y]< - row $ V
dOne $ V2 [dOne $ X == row $ Y& dOne $ Y == row $ X]< - row $ V
}
dOne
}

这是一个测试用例,涵盖了我所期待的(使用上面的数据框架):

  test_that(将V列复制到另一个数据帧,{
dfOne< - data.frame(X = c(a,a,a,a,b b,b,c,c,d),
Y = c(b,c,d,e ,d,e,d,e,e),
V = c(2,3,0,0,2,3,0,2,0,0) )

dfTwo< - data.frame(X = c(a,a,a,b,b,c,e
Y = c(b,c,d,c,d,d,d),
V = c(1,1,1 ,1,1,1,1))

lv< - union(levels(dfTwo $ Y),levels(dfTwo $ X))
dfExpected< - data.frame X = c(a,a,a,a,b,b,b,c,c,d),
Y = c(b,c,d,e,c,d,e,d,e,e),
V = c(2,3,0,0,2, 3,0,2,0,0),
V2 = c(1,1,1,0,1,1,0,1,0,1))
df Expected $ X <因子(dfExpected $ X,levels = lv)
dfExpected $ Y< - factor(dfExpected $ Y,levels = lv)

dfMerged< - copyadjlistValueColumn(dfOne,dfTwo)

expect_identical(dfMerged,dfExpected)
})

任何建议?



非常感谢:)

解决方案

$ c> merge ,其中匹配列的顺序在第二个反转,以获得双向匹配。那么你可以使用例如 rowSums 将两个创建的列折叠为一个。

  d1< - 合并(dfOne,dfTwo,by.x = c(X,Y),by.y = c(X,Y),all.x = TRUE)
d2& - 合并(d1,dfTwo,by.x = c(X,Y),by.y = c(Y,X),all.x = TRUE)
cbind(dfOne ,V2 = rowSums(cbind(d2 $ Vy,d2 $ V),na.rm = TRUE))


#XYV V2
#1 ab 2 1
#2 ac 3 1
#3广告0 1
#4 ae 0 0
#5 bc 2 1
#6 bd 3 1
#7为0 0
#8 cd 2 1
#9 ce 0 0
#10 de 0 1

要更快地选择 merge ,请检查 data.table dplyr 这里的替代方法:stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right /


I am trying to "merge" column V of a Data Frame in another one if the columns X and Y are equals (I have to match dOne.X == dTwo.X & dOne.Y == dTwo.Y and also dOne.X == dTwo.Y & dOne.Y == dTwo.X) I solved this using a for loop, but it is slow when the Data Frame dOne is big (in my machine it takes 25 minutes if length(dOne.X) == 500000). I would like to know if there is a way to solve this problem using a faster "vectorized" operation. Above is an exemple of what I want to do:

Data Frame ONE
X Y  V
a b  2
a c  3
a d  0
a e  0
b c  2
b d  3
b e  0
c d  2
c e  0
d e  0

Data Frame TWO
X Y  V
a b  1
a c  1
a d  1
b c  1
b d  1
c d  1
e d  1

Expected Data Frame after the columns are merged
X Y  V V2
a b  2  1
a c  3  1
a d  0  1
a e  0  0
b c  2  1
b d  3  1
b e  0  0
c d  2  1
c e  0  0
d e  0  1

This is the code I am using so far that is slow when dOne is big (hundreds of thousands or rows):

copyadjlistValueColumn <- function(dOne, dTwo) {
    dOne$V2 <- 0

    lv <- union(levels(dOne$Y), levels(dOne$X))

    dTwo$X <- factor(dTwo$X, levels = lv)
    dTwo$Y <- factor(dTwo$Y, levels = lv)
    dOne$X <- factor(dOne$X, levels = lv)
    dOne$Y <- factor(dOne$Y, levels = lv)

    for(i in 1:nrow(dTwo)) {
      row <- dTwo[i,]
      dOne$V2[dOne$X == row$X & dOne$Y == row$Y] <- row$V
      dOne$V2[dOne$X == row$Y & dOne$Y == row$X] <- row$V
    }
    dOne
}

This is a testthat test case that covers what I am expecting (using the data frames above):

test_that("Copy V column to another Data Frame", {
    dfOne <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
                        Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
                        V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0))

    dfTwo <- data.frame(X=c("a", "a", "a", "b", "b", "c", "e"),
                        Y=c("b", "c", "d", "c", "d", "d", "d"),
                        V=c(1, 1, 1, 1, 1, 1, 1))

    lv <- union(levels(dfTwo$Y), levels(dfTwo$X))
    dfExpected <- data.frame(X=c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d"),
                             Y=c("b", "c", "d", "e", "c", "d", "e", "d", "e", "e"),
                             V=c(2, 3, 0, 0, 2, 3, 0, 2, 0, 0),
                             V2=c(1, 1, 1, 0, 1, 1, 0, 1, 0, 1))
    dfExpected$X <- factor(dfExpected$X, levels = lv)
    dfExpected$Y <- factor(dfExpected$Y, levels = lv)

    dfMerged <- copyadjlistValueColumn(dfOne, dfTwo)

    expect_identical(dfMerged, dfExpected)
})

Any suggestion?

Thanks a lot :)

解决方案

Try two merge, where order of matching columns is reversed in the second, to get the 'bidirectional' matching. Then you may use e.g. rowSums to collapse the two created columns to one.

d1 <- merge(dfOne, dfTwo, by.x = c("X", "Y"), by.y = c("X", "Y"), all.x = TRUE)
d2 <- merge(d1, dfTwo, by.x = c("X", "Y"), by.y = c("Y", "X"), all.x = TRUE)
cbind(dfOne, V2 = rowSums(cbind(d2$V.y, d2$V), na.rm = TRUE))


#    X Y V V2
# 1  a b 2  1
# 2  a c 3  1
# 3  a d 0  1
# 4  a e 0  0
# 5  b c 2  1
# 6  b d 3  1
# 7  b e 0  0
# 8  c d 2  1
# 9  c e 0  0
# 10 d e 0  1

For faster alternatives to merge, check data.table and dplyr alternatives here: stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right/

这篇关于通过匹配其列来合并两个不同大小的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆