将列间隔聚合到data.table中的新列中 [英] Aggregate column intervals into new columns in data.table

查看:300
本文介绍了将列间隔聚合到data.table中的新列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想根据列的间隔(时间)聚合 data.table 。这里的想法是每个区间应该是一个单独的列,在输出中有不同的名称。



我看到一个类似的问题在SO ,但我不能得到我的头的问题。帮助?



可重现的示例



  b 
$ b#sample data
set.seed(1L)
dt < - data.table(id = sample(LETTERS,50,replace = TRUE),
time = sample(60,50,replace = TRUE),
points = sample(1000,50,replace = TRUE))

简单摘要`id`
dt [ ,。(total = sum(points)),by = id]
> id total
> 1:J 2058
> 2:T 1427
> 3:C 1020

所需输出在它们起源的间隔大小之后。例如,具有三个间隔,例如 10 20 30 ,输出的头应该是:

  id |总| subtotal_under10 | subtotal_under20 | subtotal_under30 


解决方案

独家小计类别



  set.seed(1L); 
N <-50L;
dt< - data.table(id = sample(LETTERS,N,T),time = sample(60L,N,T),points = sample

break< - seq(0L,as.integer(ceiling((max(dt $ time)+ 1L)/ 10)* 10)
cut< - cut(dt $ time,breaks,labels = paste0('subtotal_under',break [-1L]),right = F);
res <-dcast(dt [,。(subtotal = sum(points)),。(id,cut = cuts)],id_cut,value.var ='subtotal'
res< - res [dt [,。(total = sum(points)),id]] [order(id)];
res;






  #id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total 
## 1:A NA NA 176 NA NA 512 688
## 2:B NA NA 599 NA NA NA 599
## 3:C 527 NA NA NA NA NA 527
## 4:D NA NA 174 NA NA NA 174
## 5:E NA 732 643 NA NA NA 1375
## 6:F 634 NA NA NA NA 1473 2107
## 7:G NA NA 1410 NA NA NA 1410
## 8:I NA NA NA NA NA 596 596
## 9:J 447 NA 640 NA NA 354 1441
## 10:K 508 NA NA NA NA 454 962
## 11:M NA 14 1358 NA NA NA 1372
## 12:N NA NA NA NA 730 NA 730
## 13:O NA NA 271 NA NA 259 530
## 14:P NA NA NA NA 78 NA 78
## 15:Q 602 NA 485 NA 925 NA 2012
## 16:R NA 599 357 479 NA NA 1435
## 17:S NA 986 716 865 NA NA 2567
## 18:T NA NA NA NA 105 NA 105
## 19:U NA NA NA 239 1163 641 2043
## 20:V NA 683 NA NA 929 NA 1612
## 21:W NA NA NA NA 229 NA 229
# #22:X 214 993 NA NA NA NA 1207
## 23:Y NA 130 992 NA NA NA 1122
## 24:Z NA NA NA NA 104 NA 104
## id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total






累计小计类别



我想出了一个基于累积小计要求的新解决方案。



我的目标是以避免循环操作,例如 lapply(),因为我意识到应该可以只使用向量化操作计算所需的结果,例如 findInterval ),矢量化/累积运算,如 cumsum()和向量索引。

我成功了,但是我应该警告你,算法是相当复杂的,在它的逻辑。

  break<  -  seq(0L,as.integer(ceiling dt $ time)+ 1L)/ 10)* 10),10L); 
ints< - findInterval(dt $ time,breaks);
res< - dt [,{y< - ints [.I]; o - (y); y <-y [o]; w = - 其中(c [y [ - 长度(y)] = y [-1L],T) v <-rep(c(NA,w),diff(c(1L,y [w],length(breaks) c(sum(points),as.list(cumsum(points [o])[v])); },id] [order(id)];
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks [-1L])));
res;
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
## 1:A 688 NA NA 176 176 176 688
## 2:B 599 NA NA 599 599 599 599
## 3:C 527 527 527 527 527 527 527
## 4:D 174 NA NA 174 174 174 174
## 5:E 1375 NA 732 1375 1375 1375 1375
## 6:F 2107 634 634 634 634 634 2107
## 7:G 1410 NA NA 1410 1410 1410 1410
## 8:I 596 NA NA NA NA NA 596
## 9: J 1441 447 447 1087 1087 1087 1441
## 10:K 962 508 508 508 508 508 962
## 11:M 1372 NA 14 1372 1372 1372 1372
## 12:N 730 NA NA NA NA 730 730
## 13:O 530 NA NA 271 271 271 530
## 14:P 78 NA NA NA NA 78 78
## 15:Q 2012 602 602 1087 1087 2012 2012
## 16:R 1435 NA 599 956 1435 1435 1435
## 17:S 2567 NA 986 1702 2567 2567 2567
## 18:T 105 NA NA NA NA 105 105
## 19:U 2043 NA NA NA 239 1402 2043
## 20:V 1612 NA 683 683 683 1612 1612
## 21:W 229 NA NA NA NA 229 229
## 22:X 1207 214 1207 1207 1207 1207 1207
## 23:Y 1122 NA 130 1122 1122 1122 1122
## 24:Z 104 NA NA NA 104 104
## id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60



说明



  break <-seq(0L,as.integer(ceiling((max(dt $ time)+ 1L)/ 10)* 10) 
break< - seq(0,ceiling(max(dt $ time)/ 10)* 10,10); ## old派生,参考

首先,我们派生 c $ c>如前。我应该提到,我意识到在我的原始派生算法中有一个微妙的错误。也就是说,如果最大的时间值是10的倍数,那么派生的 break 考虑我们是否有最大的时间值为60.原始计算序列的上限将是 ceiling(60 / 10)* 10 ,这只是60了。但是它应该是70,因为值60在技术上属于 60 <= time < 70 interval。我通过在计算序列的上限时向最大的时间增加1来将此修改为新代码(并追溯修改旧代码)。我还将两个字面量改为整数,并添加了 as.integer()强制以保留整数。






  ints < -  findInterval(dt $ time,breaks); 

其次,我们预先计算每个时间值下降。我们可以为整个表预先计算一次,因为我们可以在 j id c>后续data.table索引操作的参数。注意, findInterval()使用默认参数为我们的目的表现完美;我们不需要混淆 rightmost.closed all.inside 左.open 。这是因为 findInterval()默认使用 lower< = value<上逻辑,并且值不可能低于最低断点(为零)或高于最高断点(必须大于最大时间 value,因为我们导出它的方式)。






  res<  -  dt [,{y<  -  ints [.I] o  - (y)。 y <-y [o]; w =  - 其中(c [y [ - 长度(y)] = y [-1L],T) v <-rep(c(NA,w),diff(c(1L,y [w] c(sum(points),as.list(cumsum(points [o])[v])); },id] [order(id)]; 

第三,我们使用data.table索引操作计算聚合,按 id 。 (之后我们使用链接索引操作按 id 排序,但这不重要。) j 参数包括6

  y < -  ints [.I]; 

这会提取当前 id

  o < -  order(y); 

这将按时间间隔捕获组的记录顺序。对于 points 的累积求和,以及在该累积和中的哪些索引的推导代表期望的间隔小计,我们将需要这个顺序。注意,区间内顺序(即关系)是不相关的,因为我们只提取每个区间的最终小计,无论是否以及如何 order()打破关系。

  y <-y [o] 

这实际上将 y 重新排序为间隔顺序。

  w < -  which(c(y [-length(y)]!= y [ ); 

这计算每个间隔序列的端点包括间隔的最终元素的那些元素。此向量将始终包含至少一个索引,它将不会包含比有间隔的更多索引,并且它将是唯一的。

   

这会重复 w 其距离(以间隔测量)与其下面的元素。我们在 y [w] 上使用 diff()来计算这些距离,需要一个附加的 length(breaks)元素来正确处理 w 的最后一个元素。我们还需要覆盖如果第一个间隔(和零个或多个后续间隔)不在组中表示,在这种情况下,我们必须用NA填充它。这需要在 w 前添加一个NA,并在参数向量前面加上 diff()

  c(sum(points),as.list(cumsum(points [o])[v])); 

最后,我们可以计算组聚合结果。因为您需要一个总列,然后单独的小计列,我们需要一个列表,从总聚合开始,后面是每个小计值一个列表组件。 points [o] 给我们以间隔顺序给出目标求和操作数,然后累加求和,然后用 v 以产生正确的累积小计的顺序。我们必须使用 as.list()将向量强制转换为列表,然后使用总聚合前缀列表,这只是整个 points 向量。然后从 j 表达式返回结果列表。






  setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks [-1L]))); 

最后,我们设置列名。与在 j 表达式中反复设置它们相反,将它们设置为一次之后是更有效的。






基准化



对于基准测试,我将我的代码包装在一个函数中,并对Mike的代码执行相同操作。我决定让我的 break 变量作为一个参数,其派生作为默认参数,我也对Mike的 my_nums 变量,但没有默认参数。



另请注意,对于 identical() ,我将两个结果强制为矩阵,因为Mike的代码总是计算总和小计列为双精度,而我的代码保留输入 points 列的类型(即整数if它是整数,如果是双倍,double)。






 

code> library(data.table);
library(microbenchmark);

bgoldst < - function(dt,breaks = seq(0L,as.integer(ceiling((max(dt $ time)+ 1L)/ 10)* 10) < - findInterval(dt $ time,breaks); res< - dt [,{y< - ints [.I] o - (y)。 y <-y [o]; w = - 其中(c [y [ - 长度(y)] = y [-1L],T) v <-rep(c(NA,w),diff(c(1L,y [w],length(breaks) c(sum(points),as.list(cumsum(points [o])[v])); },id] [order(id)]; setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks [-1L]))); res; };
mike< - function(dt,my_nums){cols< - sapply(1:length(my_nums),function(x){return(paste0(subtotal_under,my_nums [x])) dt [,(cols):= lapply(my_nums,function(x)ifelse(time< x,points,NA))] dt [,total:= points]; dt [,lapply(.SD,function(x){if(all(is.na(x))){as.numeric(NA)} else {as.numeric(sum(x,na.rm = TRUE)) }}),by = id,.SDcols = c(total,cols)] [order(id)]; };






  #OP's sample input 
set.seed(1L);
N <-50L;
dt< - data.table(id = sample(LETTERS,N,T),time = sample(60L,N,T),points = sample

相同(as.matrix(bgoldst(copy(dt))),asmatrix(mike(copy(dt),c(10,20,30,40,50,60))) );
## [1] TRUE

microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60) );
##单位:毫秒
## expr min lq mean median uq max neval
## bgoldst(copy(dt))3.281380 3.484301 3.793532 3.588221 3.780023 6.322846 100
## mike (copy(dt),c(10,20,30,40,50,60))3.243746 3.442819 3.731326 3.526425 3.702832 5.618502 100







<$>

p $ p> ##大输入1
set.seed(1L);
N <-1e5L;
dt< - data.table(id = sample(LETTERS,N,T),time = sample(60L,N,T),points = sample

相同(as.matrix(bgoldst(copy(dt))),asmatrix(mike(copy(dt),c(10,20,30,40,50,60,70) )));
## [1] TRUE

microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70 )));
##单位:毫秒
## expr min lq mean median uq max neval
## bgoldst(copy(dt))19.44409 19.96711 22.26597 20.36012 21.26289 62.37914 100
## mike (copy(dt),c(10,20,30,40,50,60,70))94.35002 96.50347 101.06882 97.71544 100.07052 146.65323 100

对于这么大的输入,我的代码明显优于迈克的。



如果你想知道为什么我必须添加70 Mike的 my_nums 参数,这是因为有了这么多的记录,在 dt $ time 极高,这需要额外的间隔。您可以看到 identical()调用给出了TRUE,因此这是正确的。





b $ b

  ##大输入2 
set.seed(1L);
N <-1e6L;
dt< - data.table(id = sample(LETTERS,N,T),time = sample(60L,N,T),points = sample(1000L,

相同(as.matrix(bgoldst(copy(dt))),asmatrix(mike(copy(dt),c(10,20,30,40,50,60,70) )));
## [1] TRUE

microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70 )));
##单位:毫秒
## expr min lq mean median uq max neval
## bgoldst(copy(dt))204.8841 207.2305 225.0254 210.6545 249.5497 312.0077 100
## mike (copy(dt),c(10,20,30,40,50,60,70))1039.4480 1086.3435 1125.8285 1116.2700 1158.4772 1412.6840 100

对于这个更大的输入,性能差异稍显显着。


I would like to aggregate a data.table based on intervals of a column (time). The idea here is that each interval should be a separate column with a different name in the output.

I've seen a similar question in SO but I couldn't get my head around the problem. help?

reproducible example

library(data.table)

# sample data
  set.seed(1L)
  dt <- data.table( id= sample(LETTERS,50,replace=TRUE),
                    time= sample(60,50,replace=TRUE),
                    points= sample(1000,50,replace=TRUE))

# simple summary by `id`
   dt[, .(total = sum(points)), by=id]
>     id total
> 1:  J  2058
> 2:  T  1427
> 3:  C  1020

In the desired output, each column would be named after the interval size they originate from. For example with three intervals, say time < 10, time < 20, time < 30, the head of the output should be:

  id | total | subtotal_under10 | subtotal_under20 | subtotal_under30

解决方案

Exclusive Subtotal Categories

set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));

breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
cuts <- cut(dt$time,breaks,labels=paste0('subtotal_under',breaks[-1L]),right=F);
res <- dcast(dt[,.(subtotal=sum(points)),.(id,cut=cuts)],id~cut,value.var='subtotal');
res <- res[dt[,.(total=sum(points)),id]][order(id)];
res;


##     id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total
##  1:  A               NA               NA              176               NA               NA              512   688
##  2:  B               NA               NA              599               NA               NA               NA   599
##  3:  C              527               NA               NA               NA               NA               NA   527
##  4:  D               NA               NA              174               NA               NA               NA   174
##  5:  E               NA              732              643               NA               NA               NA  1375
##  6:  F              634               NA               NA               NA               NA             1473  2107
##  7:  G               NA               NA             1410               NA               NA               NA  1410
##  8:  I               NA               NA               NA               NA               NA              596   596
##  9:  J              447               NA              640               NA               NA              354  1441
## 10:  K              508               NA               NA               NA               NA              454   962
## 11:  M               NA               14             1358               NA               NA               NA  1372
## 12:  N               NA               NA               NA               NA              730               NA   730
## 13:  O               NA               NA              271               NA               NA              259   530
## 14:  P               NA               NA               NA               NA               78               NA    78
## 15:  Q              602               NA              485               NA              925               NA  2012
## 16:  R               NA              599              357              479               NA               NA  1435
## 17:  S               NA              986              716              865               NA               NA  2567
## 18:  T               NA               NA               NA               NA              105               NA   105
## 19:  U               NA               NA               NA              239             1163              641  2043
## 20:  V               NA              683               NA               NA              929               NA  1612
## 21:  W               NA               NA               NA               NA              229               NA   229
## 22:  X              214              993               NA               NA               NA               NA  1207
## 23:  Y               NA              130              992               NA               NA               NA  1122
## 24:  Z               NA               NA               NA               NA              104               NA   104
##     id subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60 total


Cumulative Subtotal Categories

I've come up with a new solution based on the requirement of cumulative subtotals.

My objective was to avoid looping operations such as lapply(), since I realized that it should be possible to compute the desired result using only vectorized operations such as findInterval(), vectorized/cumulative operations such as cumsum(), and vector indexing.

I succeeded, but I should warn you that the algorithm is fairly intricate, in terms of its logic. I'll try to explain it below.

breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
ints <- findInterval(dt$time,breaks);
res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];
setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));
res;
##     id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60
##  1:  A   688               NA               NA              176              176              176              688
##  2:  B   599               NA               NA              599              599              599              599
##  3:  C   527              527              527              527              527              527              527
##  4:  D   174               NA               NA              174              174              174              174
##  5:  E  1375               NA              732             1375             1375             1375             1375
##  6:  F  2107              634              634              634              634              634             2107
##  7:  G  1410               NA               NA             1410             1410             1410             1410
##  8:  I   596               NA               NA               NA               NA               NA              596
##  9:  J  1441              447              447             1087             1087             1087             1441
## 10:  K   962              508              508              508              508              508              962
## 11:  M  1372               NA               14             1372             1372             1372             1372
## 12:  N   730               NA               NA               NA               NA              730              730
## 13:  O   530               NA               NA              271              271              271              530
## 14:  P    78               NA               NA               NA               NA               78               78
## 15:  Q  2012              602              602             1087             1087             2012             2012
## 16:  R  1435               NA              599              956             1435             1435             1435
## 17:  S  2567               NA              986             1702             2567             2567             2567
## 18:  T   105               NA               NA               NA               NA              105              105
## 19:  U  2043               NA               NA               NA              239             1402             2043
## 20:  V  1612               NA              683              683              683             1612             1612
## 21:  W   229               NA               NA               NA               NA              229              229
## 22:  X  1207              214             1207             1207             1207             1207             1207
## 23:  Y  1122               NA              130             1122             1122             1122             1122
## 24:  Z   104               NA               NA               NA               NA              104              104
##     id total subtotal_under10 subtotal_under20 subtotal_under30 subtotal_under40 subtotal_under50 subtotal_under60

Explanation

breaks <- seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L);
breaks <- seq(0,ceiling(max(dt$time)/10)*10,10); ## old derivation, for reference

First, we derive breaks as before. I should mention that I realized there was a subtle bug in my original derivation algorithm. Namely, if the maximum time value is a multiple of 10, then the derived breaks vector would've been short by 1. Consider if we had a maximum time value of 60. The original calculation of the upper limit of the sequence would've been ceiling(60/10)*10, which is just 60 again. But it should be 70, since the value 60 technically belongs in the 60 <= time < 70 interval. I fixed this in the new code (and retroactively amended the old code) by adding 1 to the maximum time value when computing the upper limit of the sequence. I also changed two of the literals to integers and added an as.integer() coercion to preserve integerness.


ints <- findInterval(dt$time,breaks);

Second, we precompute the interval indexes into which each time value falls. We can precompute this once for the entire table, because we'll be able to index out each id group's subset within the j argument of the subsequent data.table indexing operation. Note that findInterval() behaves perfectly for our purposes using the default arguments; we don't need to mess with rightmost.closed, all.inside, or left.open. This is because findInterval() by default uses lower <= value < upper logic, and it's impossible for values to fall below the lowest break (which is zero) or on or above the highest break (which must be greater than the maximum time value because of the way we derived it).


res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)];

Third, we compute the aggregation using a data.table indexing operation, grouping by id. (Afterward we sort by id using a chained indexing operation, but that's not significant.) The j argument consists of 6 statements executed in a braced block which I will now explain one at a time.

y <- ints[.I];

This pulls out the interval indexes for the current id group in input order.

o <- order(y);

This captures the order of the group's records by interval. We will need this order for the cumulative summation of points, as well as the derivation of which indexes in that cumulative sum represent the desired interval subtotals. Note that the within-interval orders (i.e. ties) are irrelevant, since we're only going to extract the final subtotals of each interval, which will be the same regardless if and how order() breaks ties.

y <- y[o];

This actually reorders y to interval order.

w <- which(c(y[-length(y)]!=y[-1L],T));

This computes the endpoints of each interval sequence, IOW the indexes of only those elements that comprise the final element of an interval. This vector will always contain at least one index, it will never contain more indexes than there are intervals, and it will be unique.

v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks))));

This repeats each element of w according to its distance (as measured in intervals) from its following element. We use diff() on y[w] to compute these distances, requiring an appended length(breaks) element to properly treat the final element of w. We also need to cover if the first interval (and zero or more subsequent intervals) is not represented in the group, in which case we must pad it with NAs. This requires prepending an NA to w and prepending a 1 to the argument vector to diff().

c(sum(points),as.list(cumsum(points[o])[v]));

Finally, we can compute the group aggregation result. Since you want a total column and then separate subtotal columns, we need a list starting with the total aggregation, followed by one list component per subtotal value. points[o] gives us the target summation operand in interval order, which we then cumulatively sum, and then index with v to produce the correct sequence of cumulative subtotals. We must coerce the vector to a list using as.list(), and then prepend the list with the total aggregation, which is simply the sum of the entire points vector. The resulting list is then returned from the j expression.


setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L])));

Last, we set the column names. It is more performant to set them once after-the-fact, as opposed to having them set repeatedly in the j expression.


Benchmarking

For benchmarking, I wrapped my code in a function, and did the same for Mike's code. I decided to make my breaks variable a parameter with its derivation as the default argument, and I did the same for Mike's my_nums variable, but without a default argument.

Also note that for the identical() proofs-of-equivalence, I coerce the two results to matrix, because Mike's code always computes the total and subtotal columns as doubles, whereas my code preserves the type of the input points column (i.e. integer if it was integer, double if it was double). Coercing to matrix was the easiest way I could think of to verify that the actual data is equivalent.


library(data.table);
library(microbenchmark);

bgoldst <- function(dt,breaks=seq(0L,as.integer(ceiling((max(dt$time)+1L)/10)*10),10L)) { ints <- findInterval(dt$time,breaks); res <- dt[,{ y <- ints[.I]; o <- order(y); y <- y[o]; w <- which(c(y[-length(y)]!=y[-1L],T)); v <- rep(c(NA,w),diff(c(1L,y[w],length(breaks)))); c(sum(points),as.list(cumsum(points[o])[v])); },id][order(id)]; setnames(res,2:ncol(res),c('total',paste0('subtotal_under',breaks[-1L]))); res; };
mike <- function(dt,my_nums) { cols <- sapply(1:length(my_nums),function(x){return(paste0("subtotal_under",my_nums[x]))}); dt[,(cols) := lapply(my_nums,function(x) ifelse(time<x,points,NA))]; dt[,total := points]; dt[,lapply(.SD,function(x){ if (all(is.na(x))){ as.numeric(NA) } else{ as.numeric(sum(x,na.rm=TRUE)) } }),by=id, .SDcols=c("total",cols) ][order(id)]; };


## OP's sample input
set.seed(1L);
N <- 50L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));

identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60))));
## [1] TRUE

microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60)));
## Unit: milliseconds
##                                       expr      min       lq     mean   median       uq      max neval
##                          bgoldst(copy(dt)) 3.281380 3.484301 3.793532 3.588221 3.780023 6.322846   100
##  mike(copy(dt), c(10, 20, 30, 40, 50, 60)) 3.243746 3.442819 3.731326 3.526425 3.702832 5.618502   100

Mike's code is actually faster (usually) by a small amount for the OP's sample input.


## large input 1
set.seed(1L);
N <- 1e5L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));

identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE

microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
##                                           expr      min       lq      mean   median        uq       max neval
##                              bgoldst(copy(dt)) 19.44409 19.96711  22.26597 20.36012  21.26289  62.37914   100
##  mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 94.35002 96.50347 101.06882 97.71544 100.07052 146.65323   100

For this much larger input, my code significantly outperforms Mike's.

In case you're wondering why I had to add the 70 to Mike's my_nums argument, it's because with so many more records, the probability of getting a 60 in the random generation of dt$time is extremely high, which requires the additional interval. You can see that the identical() call gives TRUE, so this is correct.


## large input 2
set.seed(1L);
N <- 1e6L;
dt <- data.table(id=sample(LETTERS,N,T),time=sample(60L,N,T),points=sample(1000L,N,T));

identical(as.matrix(bgoldst(copy(dt))),as.matrix(mike(copy(dt),c(10,20,30,40,50,60,70))));
## [1] TRUE

microbenchmark(bgoldst(copy(dt)),mike(copy(dt),c(10,20,30,40,50,60,70)));
## Unit: milliseconds
##                                           expr       min        lq      mean    median        uq       max neval
##                              bgoldst(copy(dt))  204.8841  207.2305  225.0254  210.6545  249.5497  312.0077   100
##  mike(copy(dt), c(10, 20, 30, 40, 50, 60, 70)) 1039.4480 1086.3435 1125.8285 1116.2700 1158.4772 1412.6840   100

For this even larger input, the performance difference is slightly more pronounced.

这篇关于将列间隔聚合到data.table中的新列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆