R左外连接0填充而不是NA,同时保留左表中有效的NA [英] R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table

查看:183
本文介绍了R左外连接0填充而不是NA,同时保留左表中有效的NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在两个数据表(dt1,dt2)上执行左外连接的最简单的方法是填充值为0(或某个其他值)而不是NA(默认值),而不覆盖左数据中的有效NA值表?



一个常见的答案,例如在这个线程是使用 dplyr :: left_join data.table :: merge data.table 的dt2 [dt1]键入列括号语法,其次是简单地替换所有 NA 值由 0 在加入的数据表中。例如:

  library(data.table); 
dt1< - data.table(x = c('a','b','c','d','e'),y = c(NA,'w',NA' y','z'));
dt2< - data.table(x = c('a','b','c'),new_col = c(1,2,3));
setkey(dt1,x);
setkey(dt2,x);
merged_tables < - dt2 [dt1];
merged_tables [is.na(merged_tables)]< - 0;

这种方法必然假定在 dt1 需要保留。但是,如上例所示,结果如下:

  x new_col y 
1:a 1 0
2:b 2 w
3:c 3 0
4:d 0 y
5:e 0 z
pre>

,但所需的结果是:

  x new_col y 
1:a 1 NA
2:b 2 w
3:c 3 NA
4:d 0 y
5:e 0 z

在这样一个微不足道的情况下,而不是使用 data.table 元素替换如上所述的语法,只有 new_col 中的NA值可以被替换:

 库(dplyr); 
merged_tables< - mutate(merged_tables,new_col = ifelse(is.na(new_col),0,new_col));

但是,这种方法对于合并数十或数百个新列的大型数据集来说不实用,有时动态创建列名。即使列名全都是提前知道的,列出所有新列也是非常难看的,并且在每一列上都做一个变式替换。



做一个更好的方法?如果 dplyr :: left_join data.table :: merge 中的任何一个的语法,或 data.table 的括号很容易允许用户指定除NA之外的填充值。如下所示:

  merged_tables<  -  data.table :: merge(dt1,dt2,by =x x = TRUE,fill = 0); 

data.table dcast 函数允许用户指定 fill value,所以我认为一定要有一个更简单的方法,我只是没有想到。



建议?



编辑:@jangorecki在评论中指出有一个功能请求目前在 data.table GitHug页面完成我刚刚提到的,更新 nomatch = 0 语法。应该在下一个版本的 data.table

解决方案

您使用列索引仅引用新列,与 left_join 一样,它们都位于结果数据框架的右侧。这将是在dplyr:

  dt1 < -  data.frame(x = c('a','b' ,'c','d','e'),
y = c(NA,'w',NA,'y','z'),
stringsAsFactors = FALSE)
dt2< - data.frame(x = c('a','b','c'),
new_col = c(1,2,3),
stringsAsFactors = FALSE)

合并< - left_join(dt1,dt2)
index_new_col< - (ncol(dt1)+ 1):ncol(merged)
merged [,index_new_col] [is.na (合并[,index_new_col])]< - 0

>合并
x y new_col
1 a< NA> 1
2 b w 2
3 c< NA> 3
4 d y 0
5 e z 0


What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?

A common answer, such as in this thread is to do the left outer join with either dplyr::left_join or data.table::merge or data.table's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA values by 0 in the joined data table. For example:

library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;

This approach necessarily assumes that there are no valid NA values in dt1 that need to be preserved. Yet, as you can see in the above example, the results are:

   x new_col y
1: a       1 0
2: b       2 w
3: c       3 0
4: d       0 y
5: e       0 z

but the desired results are:

   x new_col y
1: a       1 NA
2: b       2 w
3: c       3 NA
4: d       0 y
5: e       0 z

In such a trivial case, instead of using the data.table all elements replace syntax as above, just the NA values in new_col could be replaced:

library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));

However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.

There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join, data.table::merge, or data.table's bracket easily allowed the user to specify a fill value other than NA. Something like:

merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);

data.table's dcast function allows the user to specify fill value, so I figure there must be an easier way to do this that I'm just not thinking of.

Suggestions?

EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table GitHug page to do exactly what I just mentioned, updating the nomatch=0 syntax. Should be in the next release of data.table.

解决方案

Could you use column indices to refer only to the new columns, as with left_join they'll all be on the right of the resulting data.frame? Here it would be in dplyr:

dt1 <- data.frame(x = c('a', 'b', 'c', 'd', 'e'),
                  y = c(NA, 'w', NA, 'y', 'z'),
                  stringsAsFactors = FALSE)
dt2 <- data.frame(x = c('a', 'b', 'c'),
                  new_col = c(1,2,3),
                  stringsAsFactors = FALSE)

merged <- left_join(dt1, dt2)
index_new_col <- (ncol(dt1) + 1):ncol(merged)
merged[, index_new_col][is.na(merged[, index_new_col])] <- 0

> merged
  x    y new_col
1 a <NA>       1
2 b    w       2
3 c <NA>       3
4 d    y       0
5 e    z       0

这篇关于R左外连接0填充而不是NA,同时保留左表中有效的NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆