通过ID合并R中的两个Dataframes,一个是另一个的子集 [英] Merging two Dataframes in R by ID, One is the subset of the other

查看:31
本文介绍了通过ID合并R中的两个Dataframes,一个是另一个的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有 2 个数据框:'dfold' 有 175 个变量,'dfnew' 有 75 个变量.2 个数据帧由一个主键(即pid")匹配.dfnew 是 dfold 的子集,因此 dfnew 中的所有变量也在 dfold 上,但具有更新的估算值(不再有 NA).同时 dfold 有更多的变量,我会在分析阶段需要它们.我想合并 dfmerge 中的 2 个数据帧,以便从 dfnew --> dfold 更新公共变量,但同时保留 dfold 中预先存在的变量.我已经尝试了 merge()、match()、dplyr 和 sqldf 包,但是我要么获得了仅包含更新的 75 个变量(左连接)的 dfmerge,要么获得了包含 250 个变量的 dfmerge(带有 NA 的旧变量和没有它们的新变量共存).我发现(这里)的唯一方法是一个优雅但很长(10 行)的循环,它在通过 pid 与 all.x = TRUE 选项合并后消除 *.x 变量.如果可用,您能否就获得此类结果的更有效方法提出建议?

I have 2 dataframes in R: 'dfold' with 175 variables and 'dfnew' with 75 variables. The 2 datframes are matched by a primary key (that is 'pid'). dfnew is a subset of dfold, so that all variables in dfnew are also on dfold but with updated, imputed values (no NAs anymore). At the same time dfold has more variables, and I will need them in the analysis phase. I would like to merge the 2 dataframes in dfmerge so to update common variables from dfnew --> dfold but at the same time retaining pre-existing variables in dfold. I have tried merge(), match(), dplyr, and sqldf packages, but either I obtain a dfmerge with the updated 75 variables only (left join) or a dfmerge with 250 variables (old variables with NAs and new variables without them coexist). The only way I found (here) is an elegant but pretty long (10 rows) loop that is eliminating *.x variables after a merge by pid with all.x = TRUE option). Might you please advice on a more efficient way to obtain such result if available ?

先谢谢你

P.S:为了方便起见,我创建了 dfold 和 dfnew 的最小版本:dfnew 现在有 3 个变量,没有 NA,而 dfold 有 5 个变量,包括 NA.这是数据帧结构

P.S: To make things easier, I have created a minimal version of dfold and dfnew: dfnew has now 3 variables, no NAs, while dfold has 5 variables, NAs included. Here it is the dataframes structure

折叠:

structure(list(Country = structure(c(1L, 3L, 2L, 3L, 2L), .Label = c("France", 
"Germany", "Spain"), class = "factor"), Age = c(44L, 27L, 30L, 
38L, 40L), Salary = c(72000L, 48000L, 54000L, 61000L, NA), Purchased = structure(c(1L, 
2L, 1L, 1L, 2L), .Label = c("No", "Yes"), class = "factor"), 
    pid = 1:5), .Names = c("Country", "Age", "Salary", "Purchased", 
"pid"), row.names = c(NA, 5L), class = "data.frame")

dfnew:

structure(list(Age = c(44, 27, 30), Salary = c(72000, 48000, 
54000), pid = c(1, 2, 3)), .Names = c("Age", "Salary", "pid"), row.names = c(NA, 
3L), class = "data.frame")

虽然这里的问题仅限于 2 个变量 请注意,真实场景将涉及 75 个变量.

Although here the issue is limited to just 2 variables Please remind that the real scenario will involve 75 variables.

推荐答案

好吧,这个解决方案假设您并不真正需要合并,而只想更新 中的 NA 值dfolddfnew 中的估算值.

Alright, this solution assumes that you don't really need a merge but only want to update NA values within your dfold with imputed values in dfnew.

> dfold
  Country Age Salary Purchased pid
1  France  NA  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40     NA       Yes   5

> dfnew
  Age Salary pid
1  44  72000   1
2  27  48000   2
3  30  54000   3
4  38  61000   4
5  40  70000   5

要对单列执行此操作,请尝试

To do this for a single column, try

dfold$Salary <- ifelse(is.na(dfold$Salary), dfnew$Salary[dfnew$pid == dfold$pid], dfold$Salary)

> dfold
  Country Age Salary Purchased pid
1  France  NA  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40  70000       Yes   5

在整个数据集上使用它有点棘手:

Using it on the whole dataset was a bit trickier:

首先定义除pid之外的所有常用列名:

First define all common colnames except pid:

cols <- names(dfnew)[names(dfnew) != "pid"]

> cols
[1] "Age"    "Salary"

现在使用 mapplyifelse 替换 NA 值:

Now use mapply to replace the NA values with ifelse:

dfold[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[dfnew$pid == dfold$pid], x), dfold[,cols], dfnew[,cols])

> dfold
  Country Age Salary Purchased pid
1  France  44  72000        No   1
2   Spain  27  48000       Yes   2
3 Germany  30  54000        No   3
4   Spain  38  61000        No   4
5 Germany  40  70000       Yes   5

这假设 dfnew 仅包含 dfold 中存在的列.如果不是这种情况,请使用

This assumes that dfnew only includes columns that are present in dfold. If this is not the case, use

cols <- names(dfnew)[which(names(dfnew) %in% names(dfold))][names(dfnew) != "pid"]

这篇关于通过ID合并R中的两个Dataframes,一个是另一个的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆