R:合并+更新具有第二个表的表的有效方法,其中相同列名的值填充NA [英] R: Efficient Way to Merge+Update Table With Second Table Where Values from Same Column Names Fill NAs

查看:67
本文介绍了R:合并+更新具有第二个表的表的有效方法,其中相同列名的值填充NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

摘要:我想通过共享的id键将两个表合并为all=true(完全外部联接),而不是将具有相同名称的列设置为var1.x 等,它们被合并为一列,其中左表中的缺失(NA)值由右表中的值填充(除了合并的标准行为外,即,添加具有不同ID和列的行并使用不同的名称).

Summary: I would like to merge two tables by shared id key as all=true (full outer join) where instead of columns with the same names being set as var1.x var2.y, etc., they are merged as a single column where missing (NA) values in the left table are filled in by values from the right table (in addition to the standard behavior of merge, i.e., appending rows with distinct ids and columns with distinct names).

详细信息:

我想基于共享的id键列将table1table2合并并更新,例如:

I would like to merge + update table1 with table2 based on a shared id key column such that:

1)如果table1table2的列具有相同的名称(id除外),则table1中的值(如果存在)将被保留,并在table2中的值替换. table1中的值是NA.

1) If table1 and table2 have columns with the same name (other than id), the value in table1 is left alone if it exists and replaced by the value in table2 if the value in table1 is NA.

2)如果table2包含table1没有的列(不同的名称),则将它们合并(按ID).

2) If table2 has columns that table1 does not have (different names), they are merged (by id).

3)如果table1idtable2中不匹配,则table2中不同名称列的值为NA

3) If table1 has an id that does not match in table2, values for different name columns from table2 are NA

4)如果table2idtable1中不匹配,则将其添加为新行,并且table1中不同列名称的值为NA.

4) If table2 has an id that does not match in table1, it is added as a new row and the values for the different column names from table1 are NA.

3& 4与标准mergeall=TRUE一样.

3 & 4 are as with standard merge with all=TRUE.

我担心我已经解决了这个问题,因为我找不到使用mergejoin这样做的简单方法,而这并不涉及在每一列上创建ifelse检查.实际数据有〜1000列,因此对每个数据进行ifelse查找将是一个非常长的解决方案.

I'm concerned that I have overthought the problem as I cannot find a straightforward way to do this with a merge or join that doesn't involve creating ifelse checks on every column. Real data has ~1000 columns, so would be incredibly long solution to do ifelse lookups on each one.

可复制的简化示例:

table1  <- data.table(id  =c("id1", "id2", "id3", "id4", "id5", "id6"),
                      var1=c(1,2,3,4,5, 6),
                      var2=c("a", "b", NA, "d", NA, "f"),
                      var3=c(NA, 12, 13, 14, 15, 16));

table2  <- data.table(id  =c("id1", "id2", "id3", "id4", "id5", "id8"),
                      var1=c(1,2,NA,4,5, 8),
                      var2=c(NA, "b", "c", "d", "e", "h"),
                      var4=c("foo", "bar", "oof", "rab", NA, "sna"));

desired <- data.table(id=c("id1", "id2", "id3", "id4", "id5", "id6", "id8"),
                      var1=c(1,2,3,4,5, 6, 8),
                      var2=c("a", "b", "c", "d", "e", "f", "h"),
                      var3=c(NA, 12, 13, 14, 15, 16, NA),
                      var4=c("foo", "bar", "oof", "rab", NA, NA, "sna"));

table1;
    id var1 var2 var3
1: id1    1    a   NA
2: id2    2    b   12
3: id3    3   NA   13
4: id4    4    d   14
5: id5    5    e   15
6: id6    6    f   16

table2;
    id var1 var2 var4
1: id1    1    a  foo
2: id2    2    b  bar
3: id3   NA    c  oof
4: id4    4    d  rab
5: id5    5    e   NA
6: id8    8    h  sna

desired
    id var1 var2 var3 var4
1: id1    1    a   NA  foo
2: id2    2    b   12  bar
3: id3    3    c   13  oof
4: id4    4    d   14  rab
5: id5    5    e   15   NA
6: id6    6    f   16   NA
7: id8    8    h   NA  sna

所需输出的说明:

  1. 对于列var1table1具有所有值,因此将其保留,并且将忽略table2id3NA(请注意,这不包括该行)为以下所述的不同ID进行合并).

  1. For column var1, table1 had all the values, so it is left alone and the NA for id3 in table2 is ignored (note that this doesn't include the row merge for different ids described below).

对于列var2table缺少由id3索引的值,因此它是从table2更新的(请注意,这不包括下面描述的不同ID的行合并) .

For column var2, table was missing the value indexed by id3, so it is updated from table2 (note that this doesn't include the row merge for different ids described below).

对于列var3table2中没有匹配的列,因此将其保持不变.

For column var3, there is no matching column in table2, so it is kept as is.

对于列var4,在table1中没有列var4,因此它是通过id键变量从table2合并的.

For column var4, there was no column var4 in table1, so it is merged from table2 by id key variable.

对于在table1中具有id6的行,在table2中没有匹配的id6,因此仅在table2中的列var4的值在输出id6行.

For row with id6 in table1, there is no matching id6 in table2, so the value for column var4 that is only in table2 is NA in the desired output for row id6.

对于在table2中具有id8的行,在table1中没有匹配的id8,因此仅在table1中的列var3的值在desired中为NA id8行的输出.

For row with id8 in table2 there is no matching id8 in table1, so the value for column var3 that is only in table1 is NA in the desired output for row id8.

使用data.table当然可以有一种简单的方法吗?考虑到实际数据的大小,特别欢迎高效的解决方案. datamerge软件包显然曾经用于精确地做到这一点,但是它已经不在CRAN上了,我无法通过zip在R3.2.3上使用它.是否已为该任务准备了另一个软件包?还有许多其他线程专注于解决一个或几个具有已知名称的列的问题,但是对于大量列,它们似乎并不实用.

Surely there is a straightforward way to do this with data.table? Efficient solutions are particularly welcome given the size of the real data. The datamerge package apparently used to do exactly this, but it isn't on CRAN anymore and I can't get it to work on R3.2.3 from zip. Has another package stepped up for this task? There are many other threads that focus on solving this for a one or a couple of columns with known names, but for large number of columns, they don't seem practical.

推荐答案

这是一种方法:

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
com.cols.x  = paste0(com.cols, ".x")
com.cols.y  = paste0(com.cols, ".y")

# create combined table
DT = setkey(merge(table1, table2, by="id", all=TRUE), NULL)

# edit common columns where NAs are present
for (j in seq_along(com.cols)) 
  DT[is.na(get(com.cols.x[j])), (com.cols.x[j]) := get(com.cols.y[j])]

# remove unneeded columns
DT[, (com.cols.y) := NULL]

# rename kept columns
setnames(DT, com.cols.x, com.cols)

identical(DT, desired) # TRUE

创建和使用所有这些列名向量非常麻烦.

It's rather messy to create and work with all these column-names vectors.

关于原始问题...

这是另一种方式(如原始文章所述,无需从table2导入新行):

Here's another way (without importing new rows from table2, as in the original post):

com.cols    = setdiff(intersect(names(table1), names(table2)), "id")
i.com.cols  = paste0("i.", com.cols)
new.cols    = c(i.com.cols, setdiff(names(table2), c("id", com.cols)))

# grab columns from table2
table1[table2, (new.cols) := mget(new.cols), on="id"]

# edit common columns where NAs are present
for (j in seq_along(com.cols)) 
  table1[is.na(get(com.cols[j])), (com.cols[j]) := get(i.com.cols[j])]

# remove unneeded columns
table1[, (i.com.cols) := NULL]

这样,所有步骤都是对table1的修改,仅供参考.

This way all steps are modifications of table1 by reference.

这篇关于R:合并+更新具有第二个表的表的有效方法,其中相同列名的值填充NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆