识别数据框 A 中未包含在数据框 B 中的记录 [英] Identify records in data frame A not contained in data frame B

查看:18
本文介绍了识别数据框 A 中未包含在数据框 B 中的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第一次在这里发帖,所以请善待;-)

This is my first time posting here, so please be kind ;-)

编辑在我有机会做出建议的更改之前,我的问题已关闭.所以我现在正在努力做得更好,感谢到目前为止所有回答的人!

EDIT My question was closed before I had a chance to make the changes suggested to me. So I'm trying to do a better job now, thanks for everyone that answered so far!

如何识别数据帧 x.1包含在数据帧 x.2 中的记录/行基于 所有属性(即所有列)以最有效的方式可用?

How can I identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?

> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
  a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
  a  b
1 1  1
2 1  1
3 2 99
4 3  3
5 4  4

想要的结果

  a b
2 2 2
5 5 5

目前最好的解决方案

作者:Brian Ripley 教授和 Gabor Grothendieck

BEST SOLUTION SO FAR

by Prof. Brian Ripley and Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
  a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/1000000000
> sol.12
> [1] 0.000207784

迄今为止测试过的所有解决方案的集合可在我的 博客

这是封装在函数mergeX()"中的最佳解决方案:

Here's the best solution wrapped into a function 'mergeX()':

setGeneric(
    name="mergeX",
    signature=c("src.1", "src.2"),
    def=function(
        src.1,
        src.2,
        ...
    ){
    standardGeneric("mergeX")    
    }
)

setMethod(
    f="mergeX", 
    signature=signature(src.1="data.frame", src.2="data.frame"), 
    definition=function(
        src.1,
        src.2,
        do.inverse=FALSE,
        ...
    ){
    if(!do.inverse){
        out <- merge(x=src.1, y=src.2, ...)
    } else {
        if("by.y" %in% names(list(...))){
            src.2.0 <- src.2
            src.2 <- src.1
            src.1 <- src.2.0
        }
        src.1p <- do.call("paste", src.1)
        src.2p <- do.call("paste", src.2)
        out <- src.1[! src.1p %in% src.2p, ]
    }
    return(out)    
    }
)

推荐答案

这里有几个方法.#1 和 #4 假设 x.1 的行是唯一的.(如果 x.1 的行不是唯一的,那么它们将只返回重复行中的一个重复项.)其他的返回所有重复项:

Here are a few ways. #1 and #4 assume that the rows of x.1 are unique. (If rows of x.1 are not unique then they will return only one of the duplicates among the duplicated rows.) The others return all duplicates:

# 1
x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]

# 2
do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))

# 3
x.1p <- do.call("paste", x.1)
x.2p <- do.call("paste", x.2)
x.1[! x.1p %in% x.2p, ]

# 4
library(sqldf)
sqldf("select * from `x.1` except select * from `x.2`")

x.1 和 x.2 被交换,这已得到修复.也更正了开头的限制注释.

x.1 and x.2 were swapped and this has been fixed. Also have corrected note on limitations at the beginning.

这篇关于识别数据框 A 中未包含在数据框 B 中的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆