基于2个唯一标识符,找到2个数据帧元素的差异 [英] How to find differences in elements of 2 data frames based on 2 unique identifiers

查看:198
本文介绍了基于2个唯一标识符,找到2个数据帧元素的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有2个非常大的数据框,类似于以下内容:

I have 2 very large data frames similar to the following:

df1<-data.frame(DS.ID=c(123,214,543,325,123,214),OP.ID=c("xxab","xxac","xxad","xxae","xxaf","xxaq"),P.ID=c("AAC","JGK","DIF","ADL","AAC","JGR"))

> df1
  DS.ID OP.ID P.ID
1   123  xxab  AAC
2   214  xxac  JGK
3   543  xxad  DIF
4   325  xxae  ADL
5   123  xxaf  AAC
6   214  xxaq  JGR

df2<-data.frame(DS.ID=c(123,214,543,325,123,214),OP.ID=c("xxab","xxac","xxad","xxae","xxaf","xxaq"),P.ID=c("AAC","JGK","DIF","ADL","AAC","JGS"))

> df2
  DS.ID OP.ID P.ID
1   123  xxab  AAC
2   214  xxac  JGK
3   543  xxad  DIF
4   325  xxae  ADL
5   123  xxaf  AAC
6   214  xxaq  JGS

唯一的ID是基于DS.ID和OP.ID,以便可以重复DS.ID,但DS.ID和OP.ID的组合不会。我想找到P.ID更改的实例。此外,DS.ID和OP.ID的组合不一定在同一行。

The unique id is based on the combination of the DS.ID and the OP.ID, so that DS.ID can be repeated but the combination of DS.ID and OP.ID will not. I want to find the instances where P.ID changes. Also, the combination of DS.ID and OP.ID will not necessarily be in the same row.

在上面的示例中,它将返回第6行,因为P .ID已更改。我想要将初始值和最终值都写入数据框。

In the example above, it would return row 6, as the P.ID changed. I'd want to write both the initial and final values to a data frame.

我有一种感觉,初始步骤将是

I have a feeling the initial step would be

rbind.fill(df1,df2)

.fill ,因为在我试图循环的数据框中添加了列。)

(.fill because there's added columns in the data frames I'm trying to loop through).

编辑:假设有其他列也有不同的值。因此,重复的操作将无法正常工作,除非将它们隔离到自己的数据框架中。但是,我会为许多列和许多数据帧做这个,所以我宁愿不要用这种方法快速的。

Assume there's other columns that have different values as well. Thus, duplicated would not work unless you isolated them to their own data frame. But, I'll be doing this for many columns and many data frames, so I'd rather not go with that method for speed sake.

推荐答案

如果ident在以下代码中为0,那么可能两者之间有区别:

If ident is 0 in the following code, then probably, there is difference between two:

ll<-merge(df1,df2,by=c("DS.ID", "OP.ID"))
library(plyr)


 ddply(ll,.(DS.ID, OP.ID),summarize,ident=match(P.ID.x, P.ID.y,nomatch=0))
  DS.ID OP.ID ident
1   123  xxab     1
2   123  xxaf     1
3   214  xxac     1
4   214  xxaq     0
5   325  xxae     1
6   543  xxad     1

这篇关于基于2个唯一标识符,找到2个数据帧元素的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆