用另一个数据表中一列的值更新一个数据表中的一列 NA [英] Update a column of NAs in one data table with the value from a column in another data table
问题描述
我在这个主题上看到过类似的帖子,但无法理解解决方案.本质上,我有一个包含分数和一些 NA 的数据表 (DT1).无论哪里有NA",我都希望有一个过程引用另一个数据 (DT2) 表,该表在两个表 (tract) 中都有一个公共列,然后用 DT2 的分数列的值更新 DT1 中的 NA.我选择数据表类是因为我认为这是一个有效的选择.
I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.
DT1
tract CreditScore
1: 36107020401 635
2: 36083052403 NA
3: 36091062602 NA
4: 36067013000 NA
5: 36083052304 NA
DT2
tract CreditScore
1: 36107020401 635
2: 36083052403 650
3: 36091062602 335
4: 36067013000 777
5: 36083052304 663
推荐答案
我们创建了新的(更全面的)HTML 小插图 以了解一些 data.table 概念.查看此处,了解我们正在处理的其他小插曲.我正在制作用于加入的小插图,完成后有望更好地阐明这些类型的问题.
We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.
这个想法是在 tract
列上的 DT1
上首先 setkey()
.
The idea is to first setkey()
on DT1
on the column tract
.
setkey(DT1, tract)
在 data.tables 中,x[i]
形式的连接需要 x
的键,但不需要 i
.这会导致两种情况:
In data.tables, a join of the form x[i]
requires key for x
, but not necessarily for i
. This results in two scenarios:
如果
i
也有key set——i
的第一个key列与x
的第一个key列匹配,第二对第二等等..
If
i
also has key set -- the first key column ofi
is matched against first key column ofx
, second against second and so on..
如果 i
没有设置键 -- i
的第一列与第一列 key 匹配x
,i
的第二列对 x
的第二个 key 列,依此类推..
If i
doesn't have key set -- the first column of i
is matched against the first key column of x
, second column of i
against second key column of x
and so on..
在这种情况下,由于您在 i
中的第一列也是 tract
,我们将跳过 i
上的设置键.
In this case, since your first column in i
is also tract
, we'll skip setting key on i
.
然后,我们执行 x[i]
形式的连接.通过这样做,对于每个 i
,计算 x
中匹配的行索引,然后实现连接结果.但是,我们不希望整个连接结果作为新的 data.table.相反,我们想用那些匹配行上的 DT2
更新 DT1
的 CreditScore
列..
Then, we perform a join of the form x[i]
. By doing this, for each i
the matching row indices in x
is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1
's CreditScore
column with DT2
's on those matching rows..
在data.tables中,我们可以在加入时执行该操作,通过在j
中提供表达式,如下:
In data.tables, we can perform that operation while joining, by providing the expression in j
, as follows:
DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635
DT1[DT2
部分为 DT2
中的每一行查找 DT1
中匹配的行.如果匹配,我们希望 DT2
的值在 DT1
中更新.我们通过使用 i.CreditScore
来实现——它引用了 DT2
的 CreditScore
列(i.
是用于区分 x
和 i
data.tables 之间具有相同名称的列的前缀.
DT1[DT2
part finds the matching rows in DT1
for each row in DT2
. And if there's a match, we want DT2
's value to be updated in DT1
. We accomplish that by using i.CreditScore
-- it refers to DT2
's CreditScore
column (i.
is a prefix used to distinguish columns with identical names between x
and i
data.tables).
更新:正如评论中所指出的,上述解决方案还将更新 DT1
中的非 NA 值.因此,这样做的方法是:
Update: As pointed out under comments, the solution above would also update the non-NA values in DT1
. Therefore the way to do it would be:
DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]
在 DT1
中的 CreditScore
为 NA
的那些行上,替换 DT1
中的 CreditScore
code> 和 CreditScore
的值从 DT2[.(.SD)]
的连接中获得,其中 .SD
对应于data.table 包含 CreditScore
为 NA
的所有行.
On those rows where CreditScore
from DT1
is NA
, replace CreditScore
from DT1
with the values from CreditScore
obtained from the join of DT2[.(.SD)]
, where .SD
corresponds to the subset of data.table that contains all the rows where CreditScore
is NA
.
HTH
这篇关于用另一个数据表中一列的值更新一个数据表中的一列 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!