通过ID合并R中的两个Dataframes,一个是另一个的子集 [英] Merging two Dataframes in R by ID, One is the subset of the other
问题描述
我在 R 中有 2 个数据框:'dfold' 有 175 个变量,'dfnew' 有 75 个变量.2 个数据帧由一个主键(即pid")匹配.dfnew 是 dfold 的子集,因此 dfnew 中的所有变量也在 dfold 上,但具有更新的估算值(不再有 NA).同时 dfold 有更多的变量,我会在分析阶段需要它们.我想合并 dfmerge 中的 2 个数据帧,以便从 dfnew --> dfold 更新公共变量,但同时保留 dfold 中预先存在的变量.我已经尝试了 merge()、match()、dplyr 和 sqldf 包,但是我要么获得了仅包含更新的 75 个变量(左连接)的 dfmerge,要么获得了包含 250 个变量的 dfmerge(带有 NA 的旧变量和没有它们的新变量共存).我发现(这里)的唯一方法是一个优雅但很长(10 行)的循环,它在通过 pid 与 all.x = TRUE 选项合并后消除 *.x 变量.如果可用,您能否就获得此类结果的更有效方法提出建议?
I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?
先谢谢你
P.S:为了方便起见,我创建了 dfold 和 dfnew 的最小版本:dfnew 现在有 3 个变量,没有 NA,而 dfold 有 5 个变量,包括 NA.这是数据帧结构
P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure
折叠:
structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France",
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L,
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L,
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"),
pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased",
"pid"), row.names = c(NA, 5L), class = "data.frame")
dfnew:
structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000,
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA,
3L), class = "data.frame")
虽然这里的问题仅限于 2 个变量 请注意,真实场景将涉及 75 个变量.
Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.
推荐答案
好吧,这个解决方案假设您并不真正需要合并,而只想更新 中的
与 NA
值dfolddfnew
中的估算值.
Alright, this solution assumes that you don't really need a merge but only want to update NA
values within your dfold
with imputed values in dfnew
.
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 NA Yes 5
> dfnew
Age Salary pid
1 44 72000 1
2 27 48000 2
3 30 54000 3
4 38 61000 4
5 40 70000 5
要对单列执行此操作,请尝试
To do this for a single column, try
dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)
> dfold
Country Age Salary Purchased pid
1 France NA 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
在整个数据集上使用它有点棘手:
Using it on the whole dataset was a bit trickier:
首先定义除pid
之外的所有常用列名:
First define all common colnames except pid
:
cols <- names(dfnew)[names(dfnew) != "pid"]
> cols
[1] "Age" "Salary"
现在使用 mapply
用 ifelse
替换 NA
值:
Now use mapply
to replace the NA
values with ifelse
:
dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])
> dfold
Country Age Salary Purchased pid
1 France 44 72000 No 1
2 Spain 27 48000 Yes 2
3 Germany 30 54000 No 3
4 Spain 38 61000 No 4
5 Germany 40 70000 Yes 5
这假设 dfnew
仅包含 dfold
中存在的列.如果不是这种情况,请使用
This assumes that dfnew
only includes columns that are present in dfold
. If this is not the case, use
cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]
这篇关于通过ID合并R中的两个Dataframes,一个是另一个的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!