合并两个R数据帧并标识每一行的源 [英] Merge two R data frames and identify the source of each row
问题描述
在这种情况下,我需要将两个数据框合并在一起,每个数据框都包含一个有关研究主题的观察结果.不幸的是,数据捕获系统允许最终用户在两个屏幕上输入一些变量(例如,性别是在两个时间点捕获的,不应更改).没有数据库方面的检查来确认屏幕之间的数据是否一致,因此我们正在检查后处理.
I am in a circumstance where I need to merge two data frames together that each contain one observation about a research subject. Unfortunately, the data capture system allowed the end-user to enter some variables on two screens (for instance, the gender was captured at two timepoints, and should not change). There are no database-side checks to confirm that the data is consistent between screens, so we are checking in the post-processing.
我想 要做的是使用内置的R merge()
函数通过all=TRUE
选项合并数据帧,以便获得两行共享变量不匹配,然后在结果数据帧中只有一列,该行告诉我行的源(来自合并中的X或Y).据我所知,在merge()
函数中什么都没有,所以我试图为merge()
编写自己的包装程序.
What I would like to do is use the built-in R merge()
function to merge the data frames, with the all=TRUE
option so that I get two rows where the shared variables do not match, and then to have a single column in the resultant data frame that tells me the source of the row (either from X or Y in the merge). As near as I can tell, there's nothing like that in the merge()
function, so I am trying to write my own wrapper for merge()
that will do this.
示例:
example_df1 <- data.frame(subject_id=c(101,102,103,104,105),
gender=c("M","F","M","M","F"),
weight=c(120,130,110,114,144),
score=c(10,12,11,13,11))
example_df2 <- data.frame(subject_id=c(101,102,103,104,105),
gender=c("M","M","M","M","F"),
weight=c(120,130,110,117,144),
site1=c(13,18,23,12,4),
site2=c(3,7,8,11,0),
site3=c(31,28,12,29,40))
merge(x=example_df1,y=example_df2,all=TRUE)
subject_id gender weight score site1 site2 site3
1 101 M 120 10 13 3 31
2 102 F 130 12 NA NA NA
3 102 M 130 NA 18 7 28
4 103 M 110 11 23 8 12
5 104 M 114 13 NA NA NA
6 104 M 117 NA 12 11 29
7 105 F 144 11 4 0 40
所需的输出:
subject_id gender weight score site1 site2 site3 rowsource
1 101 M 120 10 13 3 31 both
2 102 F 130 12 NA NA NA x
3 102 M 130 NA 18 7 28 y
4 103 M 110 11 23 8 12 both
5 104 M 114 13 NA NA NA x
6 104 M 117 NA 12 11 29 y
7 105 F 144 11 4 0 40 both
由于项目周围的监管环境,我需要在没有任何特殊软件包的情况下在base R中实施该解决方案.我最初的想法是尝试使用intersect
查找example_df1
和example_df2
之间的公共变量,然后以某种方式将合并结果的每一行(在这些公共变量内)与example_df1
和example_df2
找出行的来源.这看起来真的很笨拙,因此,我很高兴提出有关如何提高此类任务效率的建议.谢谢!
I need to implement the solution in base R without any special packages if at all possible due to the regulatory environment surrounding the project. My initial thought is to try to use intersect
to find the common variables between both example_df1
and example_df2
, and then to somehow compare each row of the merge result (within those common variables) against both example_df1
and example_df2
to figure out the source of the row. That seems really unwieldy, so I'd appreciate suggestions on how to improve the efficiency of this kind of task. Thanks!
编辑后添加:如果R总是在这种类型的合并中始终将X行放在Y行上方,那么我想这也可以工作,但是我觉得对这种情况更稳定的感觉会更好.
EDITED TO ADD: If R always consistently puts the X row above the Y row in merges of this type, I suppose that could work too, but I think I'd feel better about something more stable than that.
推荐答案
我只会在合并之前添加另一列 ,以使生活更轻松:
I would just add another column before merging to make life easier:
example_df1$source <- "X"
example_df2$source <- "Y"
Merged <- merge(x = example_df1, y = example_df2,
all = TRUE, by = c("subject_id", "gender", "weight"))
Merged$rowSource <- apply(Merged[c("source.x", "source.y")], 1,
function(x) paste(na.omit(x), collapse = ""))
Merged
# subject_id gender weight score source.x site1 site2 site3 source.y rowSource
# 1 101 M 120 10 X 13 3 31 Y XY
# 2 102 F 130 12 X NA NA NA <NA> X
# 3 102 M 130 NA <NA> 18 7 28 Y Y
# 4 103 M 110 11 X 23 8 12 Y XY
# 5 104 M 114 13 X NA NA NA <NA> X
# 6 104 M 117 NA <NA> 12 11 29 Y Y
# 7 105 F 144 11 X 4 0 40 Y XY
从那里开始,如果您希望在输出中将"XY"
更改为"both"
,应该很容易,然后可以删除"source.x"和"source.y"列...
From there, it should be easy to change "XY"
to "both"
if that is what you prefer in your output, and you can then drop the "source.x" and "source.y" columns....
这篇关于合并两个R数据帧并标识每一行的源的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!