使用其他data.table中的值更新data.table中的值 [英] Update values in data.table with values from another data.table

查看:213
本文介绍了使用其他data.table中的值更新data.table中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大约2500万行的数据集。我采取这些行的一个子集,并执行一个工作正常的函数。然而,我需要做的是用新值更新原始数据集中的值,同时保留其余值。我相信这是直接的,但我只是不能让我的头。



这是我处理的简化版本:

  require(data.table)

df< -data.frame(AREA_CD = c :25000000,25000000,replace = FALSE)),ALLOCATED = 0,ASSIGNED =A,ID_CD = c(1:25000000))
df $ ID_CD< - 交互(ID,df $ ID_CD, sep =)
dt< - as.data.table(df)

sub_dt < - dt [5:2004,]
sub_dt [,ALLOCATED: ALLOCATED + 1]
sub_dt [,ASSIGNED:=B]

after是来自 sub_dt 的'ALLOCATED'和'ASSIGNED'中的值,以替换 dt 基于'ID_CD'列。根据我的示例,我将在后面的输出仍然有2500万行,但有2,000个更新的行。任何帮助将不胜感激。感谢。

解决方案

由David Arenburg提供的答案解释了如何将修改的数据子集连接到原始 data.table



但是,我不知道为什么OP不直接应用更改的原始 data.table 使用返回列表的函数:

  my_fun<  -  function(alloc,assig){
list(
alloc + 1,
B)
}

使用此函数,可以在 data.table 中直接更新行的子集:

  dt [5:2004,c(ALLOCATED,ASSIGNED):= my_fun(ALLOCATED,ASSIGNED)] 
dt [1:7]
#AREA_CD ALLOCATED ASSIGNED ID_CD
#1:1944 0 A ID1
#2:3265 0 A ID2
#3:15415 0 A ID3
#4:14121 0 A ID4
#5:10546 1 B ID5
#6:2263 1 B ID6
#7:12339 1 B ID7






基准



对于内存限制,只使用具有250万行(而不是OP中的2500万行)的较小数据集。

 库microbenchmark)
setDT(df)#coerce df to data.table
microbenchmark(
copy = dt join = {
dt < - copy(df)
sub_dt < - dt [5:2004,]
sub_dt [,ALLOCATED:= ALLOCATED +1]
sub_dt [,ASSIGNED:=B]
dt [sub_dt,`:=`(ALLOCATED = i.ALLOCATED,ASSIGNED = i.ASSIGNED),on =。(ID_CD)]
},
byref = {
dt < - copy(df)
dt [5:2004,c(ALLOCATED,ASSIGNED):= my_fun(ALLOCATED,ASSIGNED)]
},
times = 10L

#Unit:milliseconds
#expr min lq mean median uq max neval
#copy 13.80400 14.07850 28.22882 14.15836 14.39643 154.70570 10
#join 239.36476 240.72745 244.27668 243.52967 246.17104 255.06271 10
#byref 14.28806 14.47308 15.00056 14.63147 14.73134 18.71181 10

更新 data.table in place比创建子集和以后的连接快得多。需要复制操作才能使用 dt 的未修改版本启动每个基准运行。因此,复制操作也是基准测试。



data.table 版本1.10.4。 p>

I have a dataset with around 25 million rows. I am taking a subset of these rows and performing a function which works fine. However, what I then need to do is update the values in original dataset with new values while retaining the rest. I am sure this is straightforward but I just can't get my head around it.

This is a simplified version of what I am dealing with:

require("data.table")

df <-data.frame(AREA_CD = c(sample(1:25000000, 25000000, replace=FALSE)), ALLOCATED = 0, ASSIGNED = "A", ID_CD = c(1:25000000))
df$ID_CD <- interaction( "ID", df$ID_CD, sep = "")
dt <- as.data.table(df)

sub_dt <- dt[5:2004,]
sub_dt[,ALLOCATED:=ALLOCATED+1]
sub_dt[,ASSIGNED:="B"]

What I am after is the values in 'ALLOCATED' and 'ASSIGNED' from sub_dt to replace the 'ALLOCATED' and 'ASSIGNED' values in dt based on the 'ID_CD' column. The output I would be after, based on my example, would still have 25 million rows but have 2,000 updated rows. Any help would be much appreciated. Thanks.

解决方案

The answer provided by David Arenburg in his comment explains how to join the subset of modified data back into the original data.table.

However, I wonder why the OP doesn't apply the changes directly in the original data.table by reference using a function which returns a list:

my_fun <- function(alloc, assig) {
  list(
    alloc + 1,
    "B")
}

With this function the subset of rows can be updated directly within the data.table:

dt[5:2004, c("ALLOCATED", "ASSIGNED") := my_fun(ALLOCATED, ASSIGNED)]
dt[1:7]
#   AREA_CD ALLOCATED ASSIGNED ID_CD
#1:    1944         0        A   ID1
#2:    3265         0        A   ID2
#3:   15415         0        A   ID3
#4:   14121         0        A   ID4
#5:   10546         1        B   ID5
#6:    2263         1        B   ID6
#7:   12339         1        B   ID7


Benchmark

Due to memory limitations only a smaller data set with 2.5 million rows (instead of 25 million in the OP) is used.

library(microbenchmark)
setDT(df)  # coerce df to data.table
microbenchmark(
  copy = dt <- copy(df),
  join = {
    dt <- copy(df)
    sub_dt <- dt[5:2004,]
    sub_dt[,ALLOCATED:=ALLOCATED+1]
    sub_dt[,ASSIGNED:="B"]
    dt[sub_dt, `:=`(ALLOCATED = i.ALLOCATED, ASSIGNED = i.ASSIGNED), on = .(ID_CD)]
  },
  byref = {
    dt <- copy(df)
    dt[5:2004, c("ALLOCATED", "ASSIGNED") := my_fun(ALLOCATED, ASSIGNED)]
  },
  times = 10L
)
#Unit: milliseconds
#  expr       min        lq      mean    median        uq       max neval
#  copy  13.80400  14.07850  28.22882  14.15836  14.39643 154.70570    10
#  join 239.36476 240.72745 244.27668 243.52967 246.17104 255.06271    10
# byref  14.28806  14.47308  15.00056  14.63147  14.73134  18.71181    10

Updating the data.table "in place" is much faster than creating a subset and later join. The copy operation is required to start every benchmark run with an unmodified version of dt. Therefore, the copy operation is benchmarked as well.

data.tableversion 1.10.4 was used.

这篇关于使用其他data.table中的值更新data.table中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆