具有相同名称但数据不完整的合并/合并列 [英] merge/combine columns with same name but incomplete data
问题描述
我有两个数据框,它们的某些列具有相同的名称,而另一些具有不同的名称.数据帧如下所示:
I have two data frames that have some columns with the same names and others with different names. The data frames look something like this:
df1
ID hello world hockey soccer
1 1 NA NA 7 4
2 2 NA NA 2 5
3 3 10 8 8 23
4 4 4 17 5 12
5 5 NA NA 3 43
df2
ID hello world football baseball
1 1 2 3 43 6
2 2 5 1 24 32
3 3 NA NA 2 23
4 4 NA NA 5 15
5 5 9 7 12 23
如您所见,在两个共享列("hello"和"world")中,某些数据位于其中一个数据帧中,其余数据位于另一个数据帧中.
As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other.
我想做的是(1)通过"id"合并2个数据帧,(2)将两个帧中"hello"和"world"列中的所有数据合并为1个"hello"列, 1个世界"列,以及(3)的最终数据帧还包含2个原始帧中的所有其他列(曲棍球",足球",足球",棒球").所以,我希望最终结果是这样:
What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). So, I want the final result to be this:
ID hello world hockey soccer football baseball
1 1 2 3 7 4 43 6
2 2 5 3 2 5 24 32
3 3 10 8 8 23 2 23
4 4 4 17 5 12 5 15
5 5 9 7 3 43 12 23
我在R上很新,所以我尝试的唯一代码是merge
的变体,而且我尝试了在这里找到的答案,该答案基于类似的问题:R:合并同一变量的副本.但是,我的数据集实际上比这里显示的要大得多(大约有20个匹配的列(例如"hello"和"world")和100多个不匹配的列(例如"hockey"和"football"))所以我正在寻找不需要我手动将它们全部写出来的东西.
I'm pretty new at R so the only codes I've tried are variations on merge
and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable. However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually.
知道是否可以做到这一点?很抱歉,我无法提供我的工作示例,但是我真的不知道该从哪里开始:
Any idea if this can be done? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides:
mydata <- merge(df1, df2, by=c("ID"), all = TRUE)
要重现数据帧:
df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9),
world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12),
baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world",
"football", "baseball"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA),
world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3),
soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey",
"soccer"), class = "data.frame", row.names = c(NA, -5L))
推荐答案
这里涉及一种方法,包括melt
数据,合并的数据并使用dcast
将其恢复为较宽的格式.我添加了评论以帮助了解发生了什么.
Here's an approach that involves melt
ing your data, merging the molten data, and using dcast
to get it back to a wide form. I've added comments to help understand what is going on.
## Required packages
library(data.table)
library(reshape2)
dcast.data.table(
merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable),
## melt the second data.frame
melt(as.data.table(df2), id.vars = "ID"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
is.na(value.x), value.y, value.x)],
## This is your reshaping formula
ID ~ variable, value.var = "value")
# ID hello world football baseball hockey soccer
# 1: 1 2 3 43 6 7 4
# 2: 2 5 1 24 32 2 5
# 3: 3 10 8 2 23 8 23
# 4: 4 4 17 5 15 5 12
# 5: 5 9 7 12 23 3 43
这篇关于具有相同名称但数据不完整的合并/合并列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!