合并两个R数据帧并标识每一行的源 [英] Merge two R data frames and identify the source of each row

查看:60
本文介绍了合并两个R数据帧并标识每一行的源的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这种情况下,我需要将两个数据框合并在一起,每个数据框都包含一个有关研究主题的观察结果.不幸的是,数据捕获系统允许最终用户在两个屏幕上输入一些变量(例如,性别是在两个时间点捕获的,不应更改).没有数据库方面的检查来确认屏幕之间的数据是否一致,因此我们正在检查后处理.

I am in a circumstance where I need to merge two data frames together that each contain one observation about a research subject. Unfortunately, the data capture system allowed the end-user to enter some variables on two screens (for instance, the gender was captured at two timepoints, and should not change). There are no database-side checks to confirm that the data is consistent between screens, so we are checking in the post-processing.

我想 要做的是使用内置的R merge()函数通过all=TRUE选项合并数据帧,以便获得两行共享变量不匹配,然后在结果数据帧中只有一列,该行告诉我行的源(来自合并中的X或Y).据我所知,在merge()函数中什么都没有,所以我试图为merge()编写自己的包装程序.

What I would like to do is use the built-in R merge() function to merge the data frames, with the all=TRUE option so that I get two rows where the shared variables do not match, and then to have a single column in the resultant data frame that tells me the source of the row (either from X or Y in the merge). As near as I can tell, there's nothing like that in the merge() function, so I am trying to write my own wrapper for merge() that will do this.

示例:

example_df1 <- data.frame(subject_id=c(101,102,103,104,105),
                          gender=c("M","F","M","M","F"),
                          weight=c(120,130,110,114,144),
                          score=c(10,12,11,13,11))

example_df2 <- data.frame(subject_id=c(101,102,103,104,105),
                          gender=c("M","M","M","M","F"),
                          weight=c(120,130,110,117,144),
                          site1=c(13,18,23,12,4),
                          site2=c(3,7,8,11,0),
                          site3=c(31,28,12,29,40))

merge(x=example_df1,y=example_df2,all=TRUE)

  subject_id gender weight score site1 site2 site3
1        101      M    120    10    13     3    31
2        102      F    130    12    NA    NA    NA
3        102      M    130    NA    18     7    28
4        103      M    110    11    23     8    12
5        104      M    114    13    NA    NA    NA
6        104      M    117    NA    12    11    29
7        105      F    144    11     4     0    40

所需的输出:

  subject_id gender weight score site1 site2 site3 rowsource
1        101      M    120    10    13     3    31   both
2        102      F    130    12    NA    NA    NA    x
3        102      M    130    NA    18     7    28    y
4        103      M    110    11    23     8    12   both
5        104      M    114    13    NA    NA    NA    x
6        104      M    117    NA    12    11    29    y
7        105      F    144    11     4     0    40   both

由于项目周围的监管环境,我需要在没有任何特殊软件包的情况下在base R中实施该解决方案.我最初的想法是尝试使用intersect查找example_df1example_df2之间的公共变量,然后以某种方式将合并结果的每一行(在这些公共变量内)与example_df1example_df2找出行的来源.这看起来真的很笨拙,因此,我很高兴提出有关如何提高此类任务效率的建议.谢谢!

I need to implement the solution in base R without any special packages if at all possible due to the regulatory environment surrounding the project. My initial thought is to try to use intersect to find the common variables between both example_df1 and example_df2, and then to somehow compare each row of the merge result (within those common variables) against both example_df1 and example_df2 to figure out the source of the row. That seems really unwieldy, so I'd appreciate suggestions on how to improve the efficiency of this kind of task. Thanks!

编辑后添加:如果R总是在这种类型的合并中始终将X行放在Y行上方,那么我想这也可以工作,但是我觉得对这种情况更稳定的感觉会更好.

EDITED TO ADD: If R always consistently puts the X row above the Y row in merges of this type, I suppose that could work too, but I think I'd feel better about something more stable than that.

推荐答案

我只会在合并之前添加另一列 ,以使生活更轻松:

I would just add another column before merging to make life easier:

example_df1$source <- "X"
example_df2$source <- "Y"
Merged <- merge(x = example_df1, y = example_df2,
                all = TRUE, by = c("subject_id", "gender", "weight"))
Merged$rowSource <- apply(Merged[c("source.x", "source.y")], 1, 
                          function(x) paste(na.omit(x), collapse = ""))
Merged
#   subject_id gender weight score source.x site1 site2 site3 source.y rowSource
# 1        101      M    120    10        X    13     3    31        Y        XY
# 2        102      F    130    12        X    NA    NA    NA     <NA>         X
# 3        102      M    130    NA     <NA>    18     7    28        Y         Y
# 4        103      M    110    11        X    23     8    12        Y        XY
# 5        104      M    114    13        X    NA    NA    NA     <NA>         X
# 6        104      M    117    NA     <NA>    12    11    29        Y         Y
# 7        105      F    144    11        X     4     0    40        Y        XY

从那里开始,如果您希望在输出中将"XY"更改为"both",应该很容易,然后可以删除"source.x"和"source.y"列...

From there, it should be easy to change "XY" to "both" if that is what you prefer in your output, and you can then drop the "source.x" and "source.y" columns....

这篇关于合并两个R数据帧并标识每一行的源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆