data.table有条件地使用另一个data.table中的值替换数据 [英] data.table replace data using values from another data.table, conditionally

查看:115
本文介绍了data.table有条件地使用另一个data.table中的值替换数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这类似于更新包含来自另一个data.table R data.table替换了另一个data.table的值的索引,但是在我这种情况下,变量的数量非常大,所以我不想明确列出它们。

This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.

我拥有的是一个很大的 data.table (我们称之为 dt_original )和一个较小的 data.table (我们称其为 dt_newdata ),其ID是第一个和它只有第一个变量。我想用 dt_newdata 中的值更新 dt_original 中的值。如果要加倍说明,我 only 只想有条件地更新值-在这种情况下,仅当 dt_newdata 中的值大于相应值时在 dt_original 中。

What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.

对于可重现的示例,以下是数据。在现实世界中,表要大得多:

For a reproducible example, here are the data. In the real world the tables are much larger:

library(data.table)
set.seed(0)

## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)

for(i in 2015:2017) {
  varA <- paste0('varA_', i)
  varB <- paste0('varB_', i)
  varC <- paste0('varC_', i)

  dt_original[, (varA) := rnorm(20)]
  dt_original[, (varB) := rnorm(20)]
  dt_original[, (varC) := rnorm(20)]
}

## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)

newdata_vars <- sample(names(dt_original)[-1], 4)

for(var in newdata_vars) {
  dt_newdata[, (var) := rnorm(3)]
}

一种做我的方式t使用循环和 pmax ,但是必须有更好的方法,对吧?

Here is a way of doing it using a loop and pmax, but there has to be a better way, right?

for(var in newdata_vars) {
  k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
  dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}

似乎应该有一种使用联接语法的方法,也许还有前缀 i。和/或 .SD 或类似的东西,但是我尝试过的任何东西都不能保证在这里重复。

It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.

推荐答案

此代码应根据您的条件以当前格式运行。

This code should work in the current format given your criteria.

dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]

它加入了匹配data.tables之间的内容,然后使用:= 执行赋值因为我们要返回列表,所以我使用 Map 运行 pmax 通过与dt_newdata名称匹配的data.tables列中。请注意,dt_newdata的所有名称都必须在dt_original数据中。

It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.

在Frank的注释之后,您可以删除 Map 使用 [-1] 列出项目和列名,因为它们是ID,不需要计算。从 Map 中删除​​第一列可以避免 pmax 的一次通过,并保留ID上的键。感谢@ brian-stamper指出注释中的密钥保留。

Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to @brian-stamper for pointing out the key preservation in the comments.

dt_original[dt_newdata,
            names(dt_newdata)[-1] := Map(pmax,
                                         mget(names(dt_newdata)[-1]),
                                         dt_newdata[, .SD, .SDcols=-1])]

请注意,使用 [-1] 假定ID变量位于new_data的第一个位置。如果在其他位置,则可以手动更改索引或使用 grep

Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.

这篇关于data.table有条件地使用另一个data.table中的值替换数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆