用另一个数据表中一列的值更新一个数据表中的一列 NA [英] Update a column of NAs in one data table with the value from a column in another data table

查看:22
本文介绍了用另一个数据表中一列的值更新一个数据表中的一列 NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在这个主题上看到过类似的帖子,但无法理解解决方案.本质上,我有一个包含分数和一些 NA 的数据表 (DT1).无论哪里有NA",我都希望有一个过程引用另一个数据 (DT2) 表,该表在两个表 (tract) 中都有一个公共列,然后用 DT2 的分数列的值更新 DT1 中的 NA.我选择数据表类是因为我认为这是一个有效的选择.

I have seen a similar post on this topic, but could not comprehend the solution. Essentially I have a data table (DT1) with scores and some NAs. Wherever there is an 'NA' I would like a process that references another data (DT2) table that has a common column in both tables (tract) and then update the NA in DT1 with the value of DT2's score column. I am choosing a data table class because I think it is a an efficient option.

DT1

         tract CreditScore    
1: 36107020401         635  
2: 36083052403          NA  
3: 36091062602          NA  
4: 36067013000          NA  
5: 36083052304          NA  

DT2

         tract CreditScore  
1: 36107020401         635  
2: 36083052403         650  
3: 36091062602         335  
4: 36067013000         777  
5: 36083052304         663  

推荐答案

我们创建了新的(更全面的)HTML 小插图 以了解一些 data.table 概念.查看此处,了解我们正在处理的其他小插曲.我正在制作用于加入的小插图,完成后有望更好地阐明这些类型的问题.

We've created new (and more comprehensive) HTML vignettes for some of the data.table concepts. Have a look here for the other vignettes that we are working on. I'm working on vignettes for joins, which when done will hopefully clarify these type of problems better.

这个想法是在 tract 列上的 DT1 上首先 setkey().

The idea is to first setkey() on DT1 on the column tract.

setkey(DT1, tract)

在 data.tables 中,x[i] 形式的连接需要 x 的键,但不需要 i.这会导致两种情况:

In data.tables, a join of the form x[i] requires key for x, but not necessarily for i. This results in two scenarios:

  • 如果i也有key set——i的第一个key列与x的第一个key列匹配,第二对第二等等..

  • If i also has key set -- the first key column of i is matched against first key column of x, second against second and so on..

如果 i 没有设置键 -- i 的第一列与第一列 key 匹配xi 的第二列对 x 的第二个 key 列,依此类推..

If i doesn't have key set -- the first column of i is matched against the first key column of x, second column of i against second key column of x and so on..

在这种情况下,由于您在 i 中的第一列也是 tract,我们将跳过 i 上的设置键.

In this case, since your first column in i is also tract, we'll skip setting key on i.

然后,我们执行 x[i] 形式的连接.通过这样做,对于每个 i,计算 x 中匹配的行索引,然后实现连接结果.但是,我们不希望整个连接结果作为新的 data.table.相反,我们想用那些匹配行上的 DT2 更新 DT1CreditScore 列..

Then, we perform a join of the form x[i]. By doing this, for each i the matching row indices in x is computed, and then the join result is materialised. However, we don't want the entire join result as a new data.table. Rather, we want to update DT1's CreditScore column with DT2's on those matching rows..

在data.tables中,我们可以在加入时执行该操作,通过在j中提供表达式,如下:

In data.tables, we can perform that operation while joining, by providing the expression in j, as follows:

DT1[DT2, CreditScore := i.CreditScore]
#          tract CreditScore
# 1: 36067013000         777
# 2: 36083052304         663
# 3: 36083052403         650
# 4: 36091062602         335
# 5: 36107020401         635

DT1[DT2 部分为 DT2 中的每一行查找 DT1 中匹配的行.如果匹配,我们希望 DT2 的值在 DT1 中更新.我们通过使用 i.CreditScore 来实现——它引用了 DT2CreditScore 列(i. 是用于区分 xi data.tables 之间具有相同名称的列的前缀.

DT1[DT2 part finds the matching rows in DT1 for each row in DT2. And if there's a match, we want DT2's value to be updated in DT1. We accomplish that by using i.CreditScore -- it refers to DT2's CreditScore column (i. is a prefix used to distinguish columns with identical names between x and i data.tables).

更新:正如评论中所指出的,上述解决方案还将更新 DT1 中的非 NA 值.因此,这样做的方法是:

Update: As pointed out under comments, the solution above would also update the non-NA values in DT1. Therefore the way to do it would be:

DT1[is.na(CreditScore), CreditScore := DT2[.(.SD), CreditScore]]

DT1 中的 CreditScoreNA 的那些行上,替换 DT1 中的 CreditScorecode> 和 CreditScore 的值从 DT2[.(.SD)] 的连接中获得,其中 .SD 对应于data.table 包含 CreditScoreNA 的所有行.

On those rows where CreditScore from DT1 is NA, replace CreditScore from DT1 with the values from CreditScore obtained from the join of DT2[.(.SD)], where .SD corresponds to the subset of data.table that contains all the rows where CreditScore is NA.

HTH

这篇关于用另一个数据表中一列的值更新一个数据表中的一列 NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆