R左外部加入0填充而不是NA,同时保留左表中的有效NA [英] R Left Outer Join with 0 Fill Instead of NA While Preserving Valid NA's in Left Table
问题描述
在两个数据表(dt1,dt2)上执行左外连接的最简单方法是,填充值为0(或其他值)而不是NA(默认值),而不覆盖左侧数据中的有效NA值表?
一个常见的答案,例如这个线程是用 dplyr :: left_join
或 data.table :: merge
或 data.table
的dt2 [dt1]键入列括号语法,接下来是第二步, NA
在连接数据表中的值 0
。例如:
library(data.table);
dt1 < - data.table(x = c('a','b','c','d','e'),y = c(NA,'w',NA, y','z'));
dt2 < - data.table(x = c('a','b','c'),new_col = c(1,2,3));
setkey(dt1,x);
setkey(dt2,x);
merged_tables< - dt2 [dt1];
merged_tables [is.na(merged_tables)] < - 0;
此方法必须假设在 dt1
需要保留。但是,如上例所示,结果是:
x new_col y
pre>
1:a 1 0
2:b 2 w
3:c 3 0
4:d 0 y
5:e 0 z
但所需的结果是:
x new_col y
1:a 1 NA
2:b 2 w
3:c 3 NA
4:d 0 y
5:e 0 z
在这种琐碎的情况下,而不是使用
data.table
all元素替换上述语法,只是new_col
中的NA值可以替换:library(dplyr);
merged_tables< - mutate(merged_tables,new_col = ifelse(is.na(new_col),0,new_col));
但是,这种方法对于数十或数百个新列合并的大型数据集不适用,有时具有动态创建的列名称。即使列名都提前知道,列出所有新列,并对每个列进行mutate样式替换是非常丑陋的。
必须是一个更好的方法?如果
dplyr :: left_join
,data.table :: merge
中的任何一个的语法,或data.table
的方括号允许用户指定一个fill
值而不是NA。类似:merged_tables< - data.table :: merge(dt1,dt2,by =x,all。 x = TRUE,fill = 0);
data.table
的dcast
函数允许用户指定fill
值,所以我认为必须有一个更简单的方法,不想。
建议?
编辑:@jangorecki在评论中指出,请求当前在
data.table
上打开GitHug page 做我刚刚提到的,更新nomatch = 0
语法。应该在data.table
的下一个版本中。解决方案您使用列索引仅引用新列,如
left_join
,它们都将在生成的data.frame的右侧?这里它将在dplyr中:dt1
y = c(NA,'w',NA,'y','z'),
stringsAsFactors = FALSE)
dt2new_col = c(1,2,3),
stringsAsFactors = FALSE)
merged< - left_join(dt1,dt2)
index_new_col< - (ncol(dt1)+ 1):ncol(merged)
merged [,index_new_col] [is.na (merged [,index_new_col])] < - 0
>合并
x y new_col
1 a< NA> 1
2 b w 2
3 c< NA> 3
4 d y 0
5 e z 0
What is the easiest way to do a left outer join on two data tables (dt1, dt2) with the fill value being 0 (or some other value) instead of NA (default) without overwriting valid NA values in the left data table?
A common answer, such as in this thread is to do the left outer join with either
dplyr::left_join
ordata.table::merge
ordata.table
's dt2[dt1] keyed column bracket syntax, followed by a second step simply replacing allNA
values by0
in the joined data table. For example:library(data.table); dt1 <- data.table(x=c('a', 'b', 'c', 'd', 'e'), y=c(NA, 'w', NA, 'y', 'z')); dt2 <- data.table(x=c('a', 'b', 'c'), new_col=c(1,2,3)); setkey(dt1, x); setkey(dt2, x); merged_tables <- dt2[dt1]; merged_tables[is.na(merged_tables)] <- 0;
This approach necessarily assumes that there are no valid NA values in
dt1
that need to be preserved. Yet, as you can see in the above example, the results are:x new_col y 1: a 1 0 2: b 2 w 3: c 3 0 4: d 0 y 5: e 0 z
but the desired results are:
x new_col y 1: a 1 NA 2: b 2 w 3: c 3 NA 4: d 0 y 5: e 0 z
In such a trivial case, instead of using the
data.table
all elements replace syntax as above, just the NA values innew_col
could be replaced:library(dplyr); merged_tables <- mutate(merged_tables, new_col = ifelse(is.na(new_col), 0, new_col));
However, this approach is not practical for very large data sets where dozens or hundreds of new columns are merged, sometimes with dynamically created column names. Even if the column names were all known ahead of time, it's very ugly to list out all the new columns and do a mutate-style replace on each one.
There must be a better way? The issue would be simply resolved if the syntax of any of
dplyr::left_join
,data.table::merge
, ordata.table
's bracket easily allowed the user to specify afill
value other than NA. Something like:merged_tables <- data.table::merge(dt1, dt2, by="x", all.x=TRUE, fill=0);
data.table
'sdcast
function allows the user to specifyfill
value, so I figure there must be an easier way to do this that I'm just not thinking of.Suggestions?
EDIT: @jangorecki pointed out in the comments that there is a feature request currently open on the
data.table
GitHug page to do exactly what I just mentioned, updating thenomatch=0
syntax. Should be in the next release ofdata.table
.解决方案Could you use column indices to refer only to the new columns, as with
left_join
they'll all be on the right of the resulting data.frame? Here it would be in dplyr:dt1 <- data.frame(x = c('a', 'b', 'c', 'd', 'e'), y = c(NA, 'w', NA, 'y', 'z'), stringsAsFactors = FALSE) dt2 <- data.frame(x = c('a', 'b', 'c'), new_col = c(1,2,3), stringsAsFactors = FALSE) merged <- left_join(dt1, dt2) index_new_col <- (ncol(dt1) + 1):ncol(merged) merged[, index_new_col][is.na(merged[, index_new_col])] <- 0 > merged x y new_col 1 a <NA> 1 2 b w 2 3 c <NA> 3 4 d y 0 5 e z 0
这篇关于R左外部加入0填充而不是NA,同时保留左表中的有效NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!