R左外连接0填充而不是NA,同时保留左表中有效的NA [英] R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table
问题描述
在两个数据表(dt1,dt2)上执行左外连接的最简单的方法是填充值为0(或某个其他值)而不是NA(默认值),而不覆盖左数据中的有效NA值表?
一个常见的答案,例如在这个线程是使用 dplyr :: left_join
或 data.table :: merge
或 data.table
的dt2 [dt1]键入列括号语法,其次是简单地替换所有 NA
值由 0
在加入的数据表中。例如:
library(data.table);
dt1< - data.table(x = c('a','b','c','d','e'),y = c(NA,'w',NA' y','z'));
dt2< - data.table(x = c('a','b','c'),new_col = c(1,2,3));
setkey(dt1,x);
setkey(dt2,x);
merged_tables < - dt2 [dt1];
merged_tables [is.na(merged_tables)]< - 0;
这种方法必然假定在 dt1
需要保留。但是,如上例所示,结果如下:
x new_col y
pre>
1:a 1 0
2:b 2 w
3:c 3 0
4:d 0 y
5:e 0 z
,但所需的结果是:
x new_col y
1:a 1 NA
2:b 2 w
3:c 3 NA
4:d 0 y
5:e 0 z
在这样一个微不足道的情况下,而不是使用
data.table
元素替换如上所述的语法,只有new_col
中的NA值可以被替换:库(dplyr);
merged_tables< - mutate(merged_tables,new_col = ifelse(is.na(new_col),0,new_col));
但是,这种方法对于合并数十或数百个新列的大型数据集来说不实用,有时动态创建列名。即使列名全都是提前知道的,列出所有新列也是非常难看的,并且在每一列上都做一个变式替换。
做一个更好的方法?如果
dplyr :: left_join
,data.table :: merge
中的任何一个的语法,或data.table
的括号很容易允许用户指定除NA之外的填充
值。如下所示:merged_tables< - data.table :: merge(dt1,dt2,by =x x = TRUE,fill = 0);
data.table
的dcast
函数允许用户指定fill
value,所以我认为一定要有一个更简单的方法,我只是没有想到。
建议?
编辑:@jangorecki在评论中指出有一个功能请求目前在
data.table
GitHug页面完成我刚刚提到的,更新nomatch = 0
语法。应该在下一个版本的data.table
。解决方案您使用列索引仅引用新列,与
left_join
一样,它们都位于结果数据框架的右侧。这将是在dplyr:dt1 < - data.frame(x = c('a','b' ,'c','d','e'),
y = c(NA,'w',NA,'y','z'),
stringsAsFactors = FALSE)
dt2< - data.frame(x = c('a','b','c'),
new_col = c(1,2,3),
stringsAsFactors = FALSE)
合并< - left_join(dt1,dt2)
index_new_col< - (ncol(dt1)+ 1):ncol(merged)
merged [,index_new_col] [is.na (合并[,index_new_col])]< - 0
>合并
x y new_col
1 a< NA> 1
2 b w 2
3 c< NA> 3
4 d y 0
5 e z 0
What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?
A common answer, such as in this thread is to do the left outer join with either
dplyr::left_join
ordata.table::merge
ordata.table
's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing allNA
values by0
in the joined data table. For example:library(data.table); dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z')); dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3)); setkey(dt1, x); setkey(dt2, x); merged_tables <- dt2[dt1]; merged_tables[is.na(merged_tables)] <- 0;
This approach necessarily assumes that there are no valid NA values in
dt1
that need to be preserved. Yet, as you can see in the above example, the results are:x new_col y 1: a 1 0 2: b 2 w 3: c 3 0 4: d 0 y 5: e 0 z
but the desired results are:
x new_col y 1: a 1 NA 2: b 2 w 3: c 3 NA 4: d 0 y 5: e 0 z
In such a trivial case, instead of using the
data.table
all elements replace syntax as above, just the NA values innew_col
could be replaced:library(dplyr); merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));
However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.
There must be a better way? The issue would be simply resolved if the syntax of any of
dplyr::left_join
,data.table::merge
, ordata.table
's bracket easily allowed the user to specify afill
value other than NA. Something like:merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);
data.table
'sdcast
function allows the user to specifyfill
value, so I figure there must be an easier way to do this that I'm just not thinking of.Suggestions?
EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the
data.table
GitHug page to do exactly what I just mentioned, updating thenomatch=0
syntax. Should be in the next release ofdata.table
.解决方案Could you use column indices to refer only to the new columns, as with
left_join
they'll all be on the right of the resulting data.frame? Here it would be in dplyr:dt1 <- data.frame(x = c('a', 'b', 'c', 'd', 'e'), y = c(NA, 'w', NA, 'y', 'z'), stringsAsFactors = FALSE) dt2 <- data.frame(x = c('a', 'b', 'c'), new_col = c(1,2,3), stringsAsFactors = FALSE) merged <- left_join(dt1, dt2) index_new_col <- (ncol(dt1) + 1):ncol(merged) merged[, index_new_col][is.na(merged[, index_new_col])] <- 0 > merged x y new_col 1 a <NA> 1 2 b w 2 3 c <NA> 3 4 d y 0 5 e z 0
这篇关于R左外连接0填充而不是NA,同时保留左表中有效的NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!