使用另一个数据表中列的值更新一个数据表中的一列NA [英] Update a column of NAs in one data table with the value from a column in another data table

查看:204
本文介绍了使用另一个数据表中列的值更新一个数据表中的一列NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这个主题上看到过类似的帖子,但无法理解解决方案。基本上我有一个数据表(DT1)与分数和一些NA。无论哪里有'NA',我想要一个进程引用另一个数据(DT2)表,在两个表(道)中有一个共同的列,然后用DT2的分数列的值更新DT1中的NA。我选择了一个数据表类,因为我认为这是一个高效的选项。



DT1

  tract CreditScore 
1:36107020401 635
2:36083052403 NA
3:36091062602 NA
4:36067013000 NA
5:36083052304 NA

DT2

  tract CreditScore 
1:36107020401 635
2:36083052403 650
3:36091062602 335
4:36067013000 777
5:36083052304 663


解决方案

我们创建了新的://github.com/Rdatatable/data.table/wiki/Getting-startedrel =nofollow> HTML vignettes 中的一些data.table概念。请查看我们正在处理的其他短信的此处。我正在为连接工作的小插曲,当完成时,希望能更好地澄清这些类型的问题。






在列 tract 上的上的 setkey()

  setkey(DT1,tract)

在data.tables中, x [i] 的连接需要 x ,但不一定适用于 i 。这导致两种情况:




  • 如果 i - i 的第一个键列与 x 的第一个键列匹配,第二个与第二个键匹配,依此类推。


  • 如果 i 没有键集 - <$ c $将 x 的第一列 i与 i 的第一个 c> c>

    的第二个


在这种情况下,由于 i 中的第一列也是 tract ,我们将跳过 i 上的设置键。



形式 x [i] 。通过这样做,对于每个 i ,计算 x 中匹配的行索引,然后实现连接结果。但是,我们不希望将整个连接结果作为新的数据表。相反,我们要使用 DT2 DT1 CreditScore



在data.tables中,我们可以在连接时通过提供 j中的表达式来执行该操作,如下所示:

  DT1 [DT2,CreditScore:= i.CreditScore] 
#tract CreditScore
#1:36067013000 777
#2:36083052304 663
#3:36083052403 650
#4:36091062602 335
#5:36107020401 635



DT1 [DT2 code> DT2 中的每一行,code> DT1 。如果有匹配,我们要在 DT1 中更新 DT2 的值。我们通过使用 i.CreditScore - 它指 DT2 CreditScore 列( i。是用于区分 x code> i data.tables)。






更新: / strong>正如在注释中指出的,上面的解决方案还会更新 DT1 中的非NA值。因此,这样做的方式是:

  DT1 [is.na(CreditScore),CreditScore:= DT2 [。 SD),CreditScore]] 

CreditScore DT1 NA ,将 CreditScore DT2 [。(。SD)]的连接获得的 CreditScore ] ,其中 .SD 对应于data.table的子集,其中包含 CreditScore NA



HTH


I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.

DT1

         tract CreditScore    
1: 36107020401         635  
2: 36083052403          NA  
3: 36091062602          NA  
4: 36067013000          NA  
5: 36083052304          NA  

DT2

         tract CreditScore  
1: 36107020401         635  
2: 36083052403         650  
3: 36091062602         335  
4: 36067013000         777  
5: 36083052304         663  

解决方案

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.


The idea is to first setkey() on DT1 on the column tract.

setkey(DT1, tract)

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

  • If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

  • If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

In this case, since your first column in i is also tract, we'll skip setting key on i.

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

DT1[DT2, CreditScore := i.CreditScore]
#          tract CreditScore
# 1: 36067013000         777
# 2: 36083052304         663
# 3: 36083052403         650
# 4: 36091062602         335
# 5: 36107020401         635

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).


Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

HTH

这篇关于使用另一个数据表中列的值更新一个数据表中的一列NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆