合并/合并具有相同名称但数据不完整的列 [英] merge/combine columns with same name but incomplete data
问题描述
我有两个数据框,其中一些列名称相同,而另一些列名称不同.数据框看起来像这样:
I have two data frames that have some columns with the same names and others with different names. The data frames look something like this:
df1
ID hello world hockey soccer
1 1 NA NA 7 4
2 2 NA NA 2 5
3 3 10 8 8 23
4 4 4 17 5 12
5 5 NA NA 3 43
df2
ID hello world football baseball
1 1 2 3 43 6
2 2 5 1 24 32
3 3 NA NA 2 23
4 4 NA NA 5 15
5 5 9 7 12 23
如您所见,在 2 个共享列(hello"和world")中,一些数据位于其中一个数据框中,其余数据位于另一个中.
As you can see, in 2 of the shared columns ("hello" and "world"), some of the data is in one of the data frames and the rest is in the other.
我想要做的是(1)通过id"合并2个数据帧,(2)将两个帧中hello"和world"列的所有数据合并到1个hello"列中,1 个世界"列,并且 (3) 使最终数据帧还包含 2 个原始帧(曲棍球"、足球"、足球"、棒球")中的所有其他列.所以,我希望最终的结果是这样的:
What I am trying to do is (1) merge the 2 data frames by "id", (2) combine all the data from the "hello" and "world" columns in both frames into 1 "hello" column and 1 "world" column, and (3) have the final data frame also contain all of the other columns in the 2 original frames ("hockey", "soccer", "football", "baseball"). So, I want the final result to be this:
ID hello world hockey soccer football baseball
1 1 2 3 7 4 43 6
2 2 5 3 2 5 24 32
3 3 10 8 8 23 2 23
4 4 4 17 5 12 5 15
5 5 9 7 3 43 12 23
我是 R 的新手,所以我尝试过的唯一代码是 merge
的变体,我尝试了我在这里找到的答案,该答案基于类似的问题:
I'm pretty new at R so the only codes I've tried are variations on merge
and I've tried the answer I found here, which was based on a similar question: R: merging copies of the same variable. However, my data sets are actually much bigger than what I'm showing here (there's about 20 matching columns (like "hello" and "world") and 100s of non-matching ones (like "hockey" and "football")) so I'm looking for something that won't require me to write them all out manually.
知道这是否可以做到吗?很抱歉,我无法提供我努力的样本,但除此之外我真的不知道从哪里开始:
Any idea if this can be done? I'm sorry I can't provide a sample of my efforts, but I really don't know where to start besides:
mydata <- merge(df1, df2, by=c("ID"), all = TRUE)
重现数据框:
df1 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(2, 5, NA, NA, 9),
world = c(3, 1, NA, NA, 7), football = c(43, 24, 2, 5, 12),
baseball = c(6, 32, 23, 15, 23)), .Names = c("ID", "hello", "world",
"football", "baseball"), class = "data.frame", row.names = c(NA, -5L))
df2 <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L), hellow = c(NA, NA, 10, 4, NA),
world = c(NA, NA, 8, 17, NA), hockey = c(7, 2, 8, 5, 3),
soccer = c(4, 5, 23, 12, 43)), .Names = c("ID", "hello", "world", "hockey",
"soccer"), class = "data.frame", row.names = c(NA, -5L))
推荐答案
这里有一种方法,涉及melt
数据,合并熔化的数据,并使用dcast
把它恢复到一个广泛的形式.我添加了评论以帮助了解正在发生的事情.
Here's an approach that involves melt
ing your data, merging the molten data, and using dcast
to get it back to a wide form. I've added comments to help understand what is going on.
## Required packages
library(data.table)
library(reshape2)
dcast.data.table(
merge(
## melt the first data.frame and set the key as ID and variable
setkey(melt(as.data.table(df1), id.vars = "ID"), ID, variable),
## melt the second data.frame
melt(as.data.table(df2), id.vars = "ID"),
## you'll have 2 value columns...
all = TRUE)[, value := ifelse(
## ... combine them into 1 with ifelse
is.na(value.x), value.y, value.x)],
## This is your reshaping formula
ID ~ variable, value.var = "value")
# ID hello world football baseball hockey soccer
# 1: 1 2 3 43 6 7 4
# 2: 2 5 1 24 32 2 5
# 3: 3 10 8 2 23 8 23
# 4: 4 4 17 5 15 5 12
# 5: 5 9 7 12 23 3 43
这篇关于合并/合并具有相同名称但数据不完整的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!