R:合并+更新具有第二个表的表的有效方法,其中相同列名的值填充NA [英] R: Efficient Way to Merge+Update Table With Second Table Where Values from Same Column Names Fill NAs
问题描述
摘要:我想通过共享的id
键将两个表合并为all=true
(完全外部联接),而不是将具有相同名称的列设置为var1.x
Summary: I would like to merge two tables by shared id
key as all=true
(full outer join) where instead of columns with the same names being set as var1.x
var2.y
, etc., they are merged as a single column where missing (NA) values in the left table are filled in by values from the right table (in addition to the standard behavior of merge, i.e., appending rows with distinct ids and columns with distinct names).
详细信息:
我想基于共享的id
键列将table1
与table2
合并并更新,例如:
I would like to merge + update table1
with table2
based on a shared id
key column such that:
1)如果table1
和table2
的列具有相同的名称(id
除外),则table1
中的值(如果存在)将被保留,并在table2
中的值替换. table1
中的值是NA.
1) If table1
and table2
have columns with the same name (other than id
), the value in table1
is left alone if it exists and replaced by the value in table2
if the value in table1
is NA.
2)如果table2包含table1没有的列(不同的名称),则将它们合并(按ID).
2) If table2 has columns that table1 does not have (different names), they are merged (by id).
3)如果table1
的id
在table2
中不匹配,则table2
中不同名称列的值为NA
3) If table1
has an id
that does not match in table2
, values for different name columns from table2
are NA
4)如果table2
的id
在table1
中不匹配,则将其添加为新行,并且table1
中不同列名称的值为NA.
4) If table2
has an id
that does not match in table1
, it is added as a new row and the values for the different column names from table1
are NA.
3& 4与标准merge
和all=TRUE
一样.
3 & 4 are as with standard merge
with all=TRUE
.
我担心我已经解决了这个问题,因为我找不到使用merge
或join
这样做的简单方法,而这并不涉及在每一列上创建ifelse
检查.实际数据有〜1000列,因此对每个数据进行ifelse
查找将是一个非常长的解决方案.
I'm concerned that I have overthought the problem as I cannot find a straightforward way to do this with a merge
or join
that doesn't involve creating ifelse
checks on every column. Real data has ~1000 columns, so would be incredibly long solution to do ifelse
lookups on each one.
可复制的简化示例:
table1 <- data.table(id =c("id1", "id2", "id3", "id4", "id5", "id6"),
var1=c(1,2,3,4,5, 6),
var2=c("a", "b", NA, "d", NA, "f"),
var3=c(NA, 12, 13, 14, 15, 16));
table2 <- data.table(id =c("id1", "id2", "id3", "id4", "id5", "id8"),
var1=c(1,2,NA,4,5, 8),
var2=c(NA, "b", "c", "d", "e", "h"),
var4=c("foo", "bar", "oof", "rab", NA, "sna"));
desired <- data.table(id=c("id1", "id2", "id3", "id4", "id5", "id6", "id8"),
var1=c(1,2,3,4,5, 6, 8),
var2=c("a", "b", "c", "d", "e", "f", "h"),
var3=c(NA, 12, 13, 14, 15, 16, NA),
var4=c("foo", "bar", "oof", "rab", NA, NA, "sna"));
table1;
id var1 var2 var3
1: id1 1 a NA
2: id2 2 b 12
3: id3 3 NA 13
4: id4 4 d 14
5: id5 5 e 15
6: id6 6 f 16
table2;
id var1 var2 var4
1: id1 1 a foo
2: id2 2 b bar
3: id3 NA c oof
4: id4 4 d rab
5: id5 5 e NA
6: id8 8 h sna
desired
id var1 var2 var3 var4
1: id1 1 a NA foo
2: id2 2 b 12 bar
3: id3 3 c 13 oof
4: id4 4 d 14 rab
5: id5 5 e 15 NA
6: id6 6 f 16 NA
7: id8 8 h NA sna
所需输出的说明:
-
对于列
var1
,table1
具有所有值,因此将其保留,并且将忽略table2
中id3
的NA
(请注意,这不包括该行)为以下所述的不同ID进行合并).
For column
var1
,table1
had all the values, so it is left alone and theNA
forid3
intable2
is ignored (note that this doesn't include the row merge for different ids described below).
对于列var2
,table
缺少由id3
索引的值,因此它是从table2
更新的(请注意,这不包括下面描述的不同ID的行合并) .
For column var2
, table
was missing the value indexed by id3
, so it is updated from table2
(note that this doesn't include the row merge for different ids described below).
对于列var3
,table2
中没有匹配的列,因此将其保持不变.
For column var3
, there is no matching column in table2
, so it is kept as is.
对于列var4
,在table1
中没有列var4
,因此它是通过id
键变量从table2
合并的.
For column var4
, there was no column var4
in table1
, so it is merged from table2
by id
key variable.
对于在table1
中具有id6
的行,在table2
中没有匹配的id6
,因此仅在table2
中的列var4
的值在id6
行.
For row with id6
in table1
, there is no matching id6
in table2
, so the value for column var4
that is only in table2
is NA in the desired
output for row id6
.
对于在table2
中具有id8
的行,在table1
中没有匹配的id8
,因此仅在table1
中的列var3
的值在desired
中为NA id8
行的输出.
For row with id8
in table2
there is no matching id8
in table1
, so the value for column var3
that is only in table1
is NA in the desired
output for row id8
.
使用data.table
当然可以有一种简单的方法吗?考虑到实际数据的大小,特别欢迎高效的解决方案. datamerge
软件包显然曾经用于精确地做到这一点,但是它已经不在CRAN上了,我无法通过zip在R3.2.3上使用它.是否已为该任务准备了另一个软件包?还有许多其他线程专注于解决一个或几个具有已知名称的列的问题,但是对于大量列,它们似乎并不实用.
Surely there is a straightforward way to do this with data.table
? Efficient solutions are particularly welcome given the size of the real data. The datamerge
package apparently used to do exactly this, but it isn't on CRAN anymore and I can't get it to work on R3.2.3 from zip. Has another package stepped up for this task? There are many other threads that focus on solving this for a one or a couple of columns with known names, but for large number of columns, they don't seem practical.
推荐答案
这是一种方法:
com.cols = setdiff(intersect(names(table1), names(table2)), "id")
com.cols.x = paste0(com.cols, ".x")
com.cols.y = paste0(com.cols, ".y")
# create combined table
DT = setkey(merge(table1, table2, by="id", all=TRUE), NULL)
# edit common columns where NAs are present
for (j in seq_along(com.cols))
DT[is.na(get(com.cols.x[j])), (com.cols.x[j]) := get(com.cols.y[j])]
# remove unneeded columns
DT[, (com.cols.y) := NULL]
# rename kept columns
setnames(DT, com.cols.x, com.cols)
identical(DT, desired) # TRUE
创建和使用所有这些列名向量非常麻烦.
It's rather messy to create and work with all these column-names vectors.
关于原始问题...
这是另一种方式(如原始文章所述,无需从table2
导入新行):
Here's another way (without importing new rows from table2
, as in the original post):
com.cols = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols = paste0("i.", com.cols)
new.cols = c(i.com.cols, setdiff(names(table2), c("id", com.cols)))
# grab columns from table2
table1[table2, (new.cols) := mget(new.cols), on="id"]
# edit common columns where NAs are present
for (j in seq_along(com.cols))
table1[is.na(get(com.cols[j])), (com.cols[j]) := get(i.com.cols[j])]
# remove unneeded columns
table1[, (i.com.cols) := NULL]
这样,所有步骤都是对table1
的修改,仅供参考.
This way all steps are modifications of table1
by reference.
这篇关于R:合并+更新具有第二个表的表的有效方法,其中相同列名的值填充NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!