识别数据帧A中不包含在数据帧B中的记录 [英] Identify records in data frame A not contained in data frame B

查看:126
本文介绍了识别数据帧A中不包含在数据帧B中的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



编辑
我的问题已关闭我有机会将修改提交给我。所以我现在要做一个更好的工作,感谢所有回答到目前为止的人!



问题



如何在数据框架 x.2中包含的数据框架 x.1 中识别记录/行根据最有效的方式可用的所有属性(即所有列)



< h2> 示例数据

 > x.1<  -  data.frame(a = c(1,2,3,4,5),b = c(1,2,3,4,5))
> x.1
ab
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 < - data.frame(a = c(1,1,2,3,4),b = c(1,1,99,3,4))
> x.2
ab
1 1 1
2 1 1
3 2 99
4 3 3
5 4 4



DESIRED RESULT



  ab 
2 2 2
5 5 5



BEST SOLUTION SO FAR



由Brian Ripley教授和Gabor Grothendieck教授

 > fun.12<  -  function(x.1,x.2,...){
+ x.1p < - do.call(粘贴,x.1)
+ x .2p< - do.call(粘贴,x.2)
+ x.1 [! x.1p%in%x.2p,]
+}
> fun.12(x.1,x.2)
a b
2 2 2
5 5 5
> sol.12< - 微基准(fun.12(x.1,x.2))
> sol.12< - median(sol.12 $ time)/ 1000000000
> sol.12
> [1] 0.000207784

迄今为止测试的所有解决方案的集合在我的博客



FINAL EDIT 2011-10-14



这是最好的解决方案包含在一个函数'mergeX()'中:

  setGeneric(
name =mergeX,
signature = c(src.1,src.2),
def = function(
src.1,
src.2,
...
){
standardGeneric(mergeX)
}


setMethod(
f =mergeX,
signature =签名(src.1 =data.frame,src.2 =data.frame),
definition = function(
src.1,
src.2,
do.inverse = FALSE,
...
){
if(!do.inverse){
out< - merge(x = src.1,y = src.2,...)
} else {
if(by.y%in%names(list(...))){
src.2.0 < - src.2
src.2 < - src.1
src.1< - src.2.0
}
src.1p< - do.call(粘贴,src.1)
src.2p< - do.call(paste,src.2)
out< - src.1 [! src.1p%in%src.2p,]
}
return(out)
}


解决方案

这里有几种方法。 #1和#4假设 x.1 的行是唯一的。 (如果 x.1 的行不是唯一的,那么它们将只返回重复行中的一个重复项。)其他返回所有重复项:

 #1 
x.1 [!duplicateated(rbind(x.2,x.1))[ - (1:nrow .2))],]

#2
do.call(rbind,setdiff(split(x.1,rownames(x.1)),split(x.2 ,rownames(x.2))))

#3
x.1p < - do.call(粘贴,x.1)
x.2p< ; - do.call(粘贴,x.2)
x.1 [! x.1p%in%x.2p,]

#4
库(sqldf)
sqldf(select * from`x.1`,select * from`x .2`)

编辑:x.1和x.2已交换,已经修复。还有一些关于限制开始的注释。


This is my first time posting here, so please be kind ;-)

EDIT My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!

QUESTION

How can I identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?

EXAMPLE DATA

> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
  a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
  a  b
1 1  1
2 1  1
3 2 99
4 3  3
5 4  4

DESIRED RESULT

  a b
2 2 2
5 5 5

BEST SOLUTION SO FAR

by Prof. Brian Ripley and Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
  a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784

A collection of all solutions tested so far is available at my blog

FINAL EDIT 2011-10-14

Here's the best solution wrapped into a function 'mergeX()':

setGeneric(
    name="mergeX",
    signature=c("src.1", "src.2"),
    def=function(
        src.1,
        src.2,
        ...
    ){
    standardGeneric("mergeX")    
    }
)

setMethod(
    f="mergeX", 
    signature=signature(src.1="data.frame", src.2="data.frame"), 
    definition=function(
        src.1,
        src.2,
        do.inverse=FALSE,
        ...
    ){
    if(!do.inverse){
        out <- merge(x=src.1, y=src.2, ...)
    } else {
        if("by.y" %in% names(list(...))){
            src.2.0 <- src.2
            src.2 <- src.1
            src.1 <- src.2.0
        }
        src.1p <- do.call("paste", src.1)
        src.2p <- do.call("paste", src.2)
        out <- src.1[! src.1p %in% src.2p, ]
    }
    return(out)    
    }
)

解决方案

Here are a few ways. #1 and #4 assume that the rows of x.1 are unique. (If rows of x.1 are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:

# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]

# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))

# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]

# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")

EDIT: x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.

这篇关于识别数据帧A中不包含在数据帧B中的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆