data.table有条件地使用另一个data.table中的值替换数据 [英] data.table replace data using values from another data.table, conditionally
问题描述
这类似于更新包含来自另一个data.table 和 R data.table替换了另一个data.table的值的索引,但是在我这种情况下,变量的数量非常大,所以我不想明确列出它们。
This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
我拥有的是一个很大的 data.table
(我们称之为 dt_original
)和一个较小的 data.table
(我们称其为 dt_newdata
),其ID是第一个和它只有第一个变量。我想用 dt_newdata
中的值更新 dt_original
中的值。如果要加倍说明,我 only 只想有条件地更新值-在这种情况下,仅当 dt_newdata
中的值大于相应值时在 dt_original
中。
What I have is a large data.table
(let's call it dt_original
) and a smaller data.table
(let's call it dt_newdata
) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original
with the values from dt_newdata
. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata
are larger than the corresponding values in dt_original
.
对于可重现的示例,以下是数据。在现实世界中,表要大得多:
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
一种做我的方式t使用循环和 pmax
,但是必须有更好的方法,对吧?
Here is a way of doing it using a loop and pmax
, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
似乎应该有一种使用联接语法的方法,也许还有前缀 i。
和/或 .SD
或类似的东西,但是我尝试过的任何东西都不能保证在这里重复。
It seems like there should be a way using join syntax, and maybe the prefix i.
and/or .SD
or something like that, but nothing I've tried comes close enough to warrant repeating here.
推荐答案
此代码应根据您的条件以当前格式运行。
This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
它加入了匹配data.tables之间的内容,然后使用:=
执行赋值因为我们要返回列表,所以我使用 Map
运行 pmax
通过与dt_newdata名称匹配的data.tables列中。请注意,dt_newdata的所有名称都必须在dt_original数据中。
It joins to the IDs that match between the data.tables and then performs an assignment using :=
Because we want to return a list, I use Map
to run pmax
through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
在Frank的注释之后,您可以删除 Map
使用 [-1]
列出项目和列名,因为它们是ID,不需要计算。从 Map
中删除第一列可以避免 pmax
的一次通过,并保留ID上的键。感谢@ brian-stamper指出注释中的密钥保留。
Following Frank's comment, you can remove the first column of the Map
list items and the column names using [-1]
because they are IDs, which don't need to be computed. Removing the first column from Map
avoids one pass of pmax
and also preserves the key on id. Thanks to @brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
请注意,使用 [-1]
假定ID变量位于new_data的第一个位置。如果在其他位置,则可以手动更改索引或使用 grep
。
Note that the use of [-1]
assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep
.
这篇关于data.table有条件地使用另一个data.table中的值替换数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!