使用另一个数据表中列的值更新一个数据表中的一列NA [英] Update a column of NAs in one data table with the value from a column in another data table
问题描述
我在这个主题上看到过类似的帖子,但无法理解解决方案。基本上我有一个数据表(DT1)与分数和一些NA。无论哪里有'NA',我想要一个进程引用另一个数据(DT2)表,在两个表(道)中有一个共同的列,然后用DT2的分数列的值更新DT1中的NA。我选择了一个数据表类,因为我认为这是一个高效的选项。
DT1
tract CreditScore
1:36107020401 635
2:36083052403 NA
3:36091062602 NA
4:36067013000 NA
5:36083052304 NA
DT2
tract CreditScore
1:36107020401 635
2:36083052403 650
3:36091062602 335
4:36067013000 777
5:36083052304 663
我们创建了新的://github.com/Rdatatable/data.table/wiki/Getting-startedrel =nofollow> HTML vignettes 中的一些data.table概念。请查看我们正在处理的其他短信的此处。我正在为连接工作的小插曲,当完成时,希望能更好地澄清这些类型的问题。
在列 tract
上的上的
setkey()
setkey(DT1,tract)
在data.tables中, x [i]
的连接需要 x
,但不一定适用于 i
。这导致两种情况:
-
如果
i
-i
的第一个键列与x
的第一个键列匹配,第二个与第二个键匹配,依此类推。 -
如果
的第二个键i
没有键集 - <$ c $将x
的第一列i与
的第一个键 c> c>i
在这种情况下,由于 i
中的第一列也是 tract
,我们将跳过 i
上的设置键。
形式 x [i]
。通过这样做,对于每个 i
,计算 x
中匹配的行索引,然后实现连接结果。但是,我们不希望将整个连接结果作为新的数据表。相反,我们要使用 DT2 $ c $更新
DT1
的 CreditScore
在data.tables中,我们可以在连接时通过提供 j中的表达式来执行该操作
,如下所示:
DT1 [DT2,CreditScore:= i.CreditScore]
#tract CreditScore
#1:36067013000 777
#2:36083052304 663
#3:36083052403 650
#4:36091062602 335
#5:36107020401 635
DT1 [DT2
code> DT2 中的每一行,code> DT1 。如果有匹配,我们要在DT1
中更新DT2
的值。我们通过使用i.CreditScore
- 它指DT2
的CreditScore
列(i。
是用于区分x
code> i data.tables)。
更新: / strong>正如在注释中指出的,上面的解决方案还会更新
DT1
中的非NA值。因此,这样做的方式是:DT1 [is.na(CreditScore),CreditScore:= DT2 [。 SD),CreditScore]]
在
CreditScore
从DT1
为NA
,将CreditScore
与
DT2 [。(。SD)]的连接获得的
,其中CreditScore
].SD
对应于data.table的子集,其中包含CreditScore
是NA
。
HTH
I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.
DT1
tract CreditScore 1: 36107020401 635 2: 36083052403 NA 3: 36091062602 NA 4: 36067013000 NA 5: 36083052304 NA
DT2
tract CreditScore 1: 36107020401 635 2: 36083052403 650 3: 36091062602 335 4: 36067013000 777 5: 36083052304 663
解决方案We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.
The idea is to first
setkey()
onDT1
on the columntract
.setkey(DT1, tract)
In data.tables, a join of the form
x[i]
requires key forx
, but not necessarily fori
. This results in two scenarios:
If
i
also has key set -- the first key column ofi
is matched against first key column ofx
, second against second and so on..If
i
doesn't have key set -- the first column ofi
is matched against the first key column ofx
, second column ofi
against second key column ofx
and so on..
In this case, since your first column in i
is also tract
, we'll skip setting key on i
.
Then, we perform a join of the form x[i]
. By doing this, for each i
the matching row indices in x
is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1
's CreditScore
column with DT2
's on those matching rows..
In data.tables, we can perform that operation while joining, by providing the expression in j
, as follows:
DT1[DT2, CreditScore := i.CreditScore]
# tract CreditScore
# 1: 36067013000 777
# 2: 36083052304 663
# 3: 36083052403 650
# 4: 36091062602 335
# 5: 36107020401 635
DT1[DT2
part finds the matching rows in DT1
for each row in DT2
. And if there's a match, we want DT2
's value to be updated in DT1
. We accomplish that by using i.CreditScore
-- it refers to DT2
's CreditScore
column (i.
is a prefix used to distinguish columns with identical names between x
and i
data.tables).
Update: As pointed out under comments, the solution above would also update the non-NA values in DT1
. Therefore the way to do it would be:
DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]
On those rows where CreditScore
from DT1
is NA
, replace CreditScore
from DT1
with the values from CreditScore
obtained from the join of DT2[.(.SD)]
, where .SD
corresponds to the subset of data.table that contains all the rows where CreditScore
is NA
.
HTH
这篇关于使用另一个数据表中列的值更新一个数据表中的一列NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!