R左外部加入0填充而不是NA,同时保留左表中的有效NA [英] R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table

查看:1063
本文介绍了R左外部加入0填充而不是NA,同时保留左表中的有效NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在两个数据表(dt1,dt2)上执行左外连接的最简单方法是,填充值为0(或其他值)而不是NA(默认值),而不覆盖左侧数据中的有效NA值表?



一个常见的答案,例如这个线程是用 dplyr :: left_join data.table :: merge 或 data.table 的dt2 [dt1]键入列括号语法,接下来是第二步, NA 在连接数据表中的值 0 。例如:

  library(data.table); 
dt1 < - data.table(x = c('a','b','c','d','e'),y = c(NA,'w',NA, y','z'));
dt2 < - data.table(x = c('a','b','c'),new_col = c(1,2,3));
setkey(dt1,x);
setkey(dt2,x);
merged_tables< - dt2 [dt1];
merged_tables [is.na(merged_tables)] < - 0;

此方法必须假设在 dt1 需要保留。但是,如上例所示,结果是:

  x new_col y 
1:a 1 0
2:b 2 w
3:c 3 0
4:d 0 y
5:e 0 z
pre>

但所需的结果是:

  x new_col y 
1:a 1 NA
2:b 2 w
3:c 3 NA
4:d 0 y
5:e 0 z

在这种琐碎的情况下,而不是使用 data.table all元素替换上述语法,只是 new_col 中的NA值可以替换:

  library(dplyr); 
merged_tables< - mutate(merged_tables,new_col = ifelse(is.na(new_col),0,new_col));

但是,这种方法对于数十或数百个新列合并的大型数据集不适用,有时具有动态创建的列名称。即使列名都提前知道,列出所有新列,并对每个列进行mutate样式替换是非常丑陋的。



必须是一个更好的方法?如果 dplyr :: left_join data.table :: merge 中的任何一个的语法,或 data.table 的方括号允许用户指定一个 fill 值而不是NA。类似:

  merged_tables<  -  data.table :: merge(dt1,dt2,by =x,all。 x = TRUE,fill = 0); 

data.table dcast 函数允许用户指定 fill 值,所以我认为必须有一个更简单的方法,不想。



建议?



编辑:@jangorecki在评论中指出,请求当前在 data.table 上打开GitHug page 做我刚刚提到的,更新 nomatch = 0 语法。应该在 data.table 的下一个版本中。

解决方案

您使用列索引仅引用新列,如 left_join ,它们都将在生成的data.frame的右侧?这里它将在dplyr中:

  dt1 y = c(NA,'w',NA,'y','z'),
stringsAsFactors = FALSE)
dt2 new_col = c(1,2,3),
stringsAsFactors = FALSE)

merged< - left_join(dt1,dt2)
index_new_col< - (ncol(dt1)+ 1):ncol(merged)
merged [,index_new_col] [is.na (merged [,index_new_col])] < - 0

>合并
x y new_col
1 a< NA> 1
2 b w 2
3 c< NA> 3
4 d y 0
5 e z 0


What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?

A common answer, such as in this thread is to do the left outer join with either dplyr::left_join or data.table::merge or data.table's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing all NA values by 0 in the joined data table. For example:

library(data.table);
dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z'));
dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3));
setkey(dt1, x);
setkey(dt2, x);
merged_tables <- dt2[dt1];
merged_tables[is.na(merged_tables)] <- 0;

This approach necessarily assumes that there are no valid NA values in dt1 that need to be preserved. Yet, as you can see in the above example, the results are:

   x new_col y
1: a       1 0
2: b       2 w
3: c       3 0
4: d       0 y
5: e       0 z

but the desired results are:

   x new_col y
1: a       1 NA
2: b       2 w
3: c       3 NA
4: d       0 y
5: e       0 z

In such a trivial case, instead of using the data.table all elements replace syntax as above, just the NA values in new_col could be replaced:

library(dplyr);
merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));

However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.

There must be a better way? The issue would be simply resolved if the syntax of any of dplyr::left_join, data.table::merge, or data.table's bracket easily allowed the user to specify a fill value other than NA. Something like:

merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);

data.table's dcast function allows the user to specify fill value, so I figure there must be an easier way to do this that I'm just not thinking of.

Suggestions?

EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the data.table GitHug page to do exactly what I just mentioned, updating the nomatch=0 syntax. Should be in the next release of data.table.

解决方案

Could you use column indices to refer only to the new columns, as with left_join they'll all be on the right of the resulting data.frame? Here it would be in dplyr:

dt1 <- data.frame(x = c('a', 'b', 'c', 'd', 'e'),
                  y = c(NA, 'w', NA, 'y', 'z'),
                  stringsAsFactors = FALSE)
dt2 <- data.frame(x = c('a', 'b', 'c'),
                  new_col = c(1,2,3),
                  stringsAsFactors = FALSE)

merged <- left_join(dt1, dt2)
index_new_col <- (ncol(dt1) + 1):ncol(merged)
merged[, index_new_col][is.na(merged[, index_new_col])] <- 0

> merged
  x    y new_col
1 a <NA>       1
2 b    w       2
3 c <NA>       3
4 d    y       0
5 e    z       0

这篇关于R左外部加入0填充而不是NA,同时保留左表中的有效NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆