数据表中的条件差计算 [英] Conditional difference calculation in data table
问题描述
我有一个百万行长的data.table约20个反式样列。这些列显示各种存储系统操作的计数器在时间上增加。有时候,计数器在被监视的系统上被复位,单个观测值低于上一个。
我需要计算一个 opsdiff
列,其将包含基于列 type
的相同类型的后续值的算术差(预期为正)。当识别计数器复位情况时 - 即。差值为负,应使用计数器的实际值。
> dt & dt
type opscounter
1:a 105609
2:a 106316
3:a 106705
4:a 489
5:a 723
6:a 1250
7:b 105609
8:b 106316
9:b 106705
10:b 489
11:b 723
12 :b 1250
我想得到的结果:
$ b b
> dt
type opscounter opsdiff
1:a 105609 NA
2:a 106316 707
3:a 106705 389
4:a 489 489
5: a 723 234
6:a 1250 527
7:b 105609 NA
8:b 106316 707
9:b 106705 389
10:b 489 489
11:b 723 234
12:b 1250 527
>
我试图构建一个基于 ifelse $ c>,但我不知道如何解决当前行的
opscounter
在 else 部分中的值,也不知道 diff()
可能是一个有效的用法:
dt [,opsdiff:= c NA,ifelse(diff(opscounter)> 0,diff(opscounter),opscounter)),by = type]
$ b b
如何正确计算 opsdiff
列?
问候,
Lukas
首选解决方案是:
dt [,opsdiff:= c(NA,diff(opscounter)),
pre>
by = type] [opsdiff< 0,opsdiff:= opscounter] []
#type opscounter opsdiff
#1:a 105609 NA
#2:a 106316 707
#3:a 106705 389
#4:a 489 489
#5:a 723 234
#6:a 1250 527
#7:b 105609 NA
#8:b 106316 707
#9:b 106705 389
#10:b 489 489
#11:b 723 234
#12:b 1250 527
请注意,为了打印结果,我已添加了
[]
并且为了说明您可以添加多个这些。
一般来说,最好避免
ifelse
(特别是在您使用这种大数据集的情况下)为可以缓慢(虽然向量化),因为它评估了是和无情况。在你的情况下,你发现另一个缺陷,你需要告诉它确切的位置,你想从opscounter
,这增加了复杂性(见 @可以覆盖的Arwr注释)。
关于你的问题,在
DT [...] $ c>操作的
data.table
$ c>只是调用[。data.table(DT,...)
。在data.frame
上没有任何区别;有一个类似的函数[。data.frame
。
请注意,
data.table
也是data.frame
。参见class(dt)
并且还读取?data.table
。
为了使它更清晰,在data.table中,tacks
[...]
一个接一个称为链接。这是免费的。您也可以在data.frame中执行相同操作(如下所示),但是您可以对data.frame
执行的操作是有限的,因此使用与数据表不同。df < - as.data.frame(dt)#或`setDF )`在1.9.4+中通过引用做这个
df [df $ type ==a,] [2:3,]
#type opscounter
#2 a 106316
#3 a 106705
为了说明
ifelse
的无效性,这里是一个基准:set.seed(123)
n< - 1e6
dt< - data.table(type = rep(c(a,b),each = n),
opscounter = sample(1:1e5,n * 2,replace = TRUE))
库(microbenchmark)
microbenchmark(
dt [,opsdiff:= c (NA,diff(opscounter)),by = type] [opsdiff <0,opsdiff:= opscounter],
dt [,opsdiff:= c(NA,ifelse (opscounter),tail(opscounter,-1L))),by = type]
)
#单位:毫秒
#expr
# :=`(opsdiff,c(NA,diff(opscounter)))by = type] [opsdiff < 0,`:=`(opsdiff,opscounter)]
#dt [,`:=`(opsdiff,c(NA,ifelse(diff(opscounter) -1L))),by = type]
#min lq mean median uq max neval
#228.0445 255.4006 285.8163 281.1388 307.4195 508.3841 100
#899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704 100
ifelse
解决方案的速度慢了〜4 。I have a millions row long data.table with about 20 counter-style columns. Those columns display various storage system operations' counters increasing in time. Sometimes however, the counter gets reset on the monitored system and the single observation value is lower than a previous one.
I need to calculate a
opsdiff
column, which would contain an arithmetical difference (expected to be positive) of subsequent values of same type based on columntype
. When a counter reset situation is identified - ie. the difference is negative, the actual value of the counter should be used.> dt <- data.table(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250)) > dt type opscounter 1: a 105609 2: a 106316 3: a 106705 4: a 489 5: a 723 6: a 1250 7: b 105609 8: b 106316 9: b 106705 10: b 489 11: b 723 12: b 1250
The result I wish to get:
> dt type opscounter opsdiff 1: a 105609 NA 2: a 106316 707 3: a 106705 389 4: a 489 489 5: a 723 234 6: a 1250 527 7: b 105609 NA 8: b 106316 707 9: b 106705 389 10: b 489 489 11: b 723 234 12: b 1250 527 >
I tried to build a construction based on
ifelse()
but I don't know how to address the current row'sopscounter
value in else part and neither the dual call ofdiff()
is perhaps an efficient usage:dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), opscounter)), by=type]
How can I correctly calculate the
opsdiff
column?Greetings,
Lukas
解决方案The preferred solution would be:
dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter][] # type opscounter opsdiff # 1: a 105609 NA # 2: a 106316 707 # 3: a 106705 389 # 4: a 489 489 # 5: a 723 234 # 6: a 1250 527 # 7: b 105609 NA # 8: b 106316 707 # 9: b 106705 389 # 10: b 489 489 # 11: b 723 234 # 12: b 1250 527
Note that I've added additional
[]
in order to print results on the fly and in order to illustrate that you can add more than one of these.In general, it would be better to avoid
ifelse
(especially in your case with such big data set) as it can to be slow (although vectorized) due to it evaluating both yes and no cases. In your case, you found another "flaw", that you need to tell it the exact locations where you want to pullopscounter
from, which adds to the complexity (see @Aruns comment for possible override).Regarding your question in comments, a
data.table
operation of the formDT[...]
just calls the function[.data.table(DT, ...)
. It's not any different on adata.frame
; there is a similar function[.data.frame
.Note that a
data.table
is also adata.frame
. Seeclass(dt)
and also read?data.table
.To make it even clearer, in data.table, tacking
[...]
one after another is called chaining. This is something that comes for free. You can do the same in a data.frame as well (as shown below), but the operations you can perform on adata.frame
are limited, and therefore the use of chaining itself, unlike data.table.df <- as.data.frame(dt) # or `setDF(dt)` in 1.9.4+ to do this by reference df[df$type == "a", ][2:3, ] # type opscounter # 2 a 106316 # 3 a 106705
Finally, in order to illustrate the ineffectiveness of
ifelse
, here is a benchmark:set.seed(123) n <- 1e6 dt <- data.table(type = rep(c("a","b"), each = n), opscounter = sample(1:1e5, n*2, replace = TRUE)) library(microbenchmark) microbenchmark( dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter], dt[, opsdiff := c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L))), by=type] ) # Unit: milliseconds # expr # dt[, `:=`(opsdiff, c(NA, diff(opscounter))), by = type][opsdiff < 0, `:=`(opsdiff, opscounter)] # dt[, `:=`(opsdiff, c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L)))), by = type] # min lq mean median uq max neval # 228.0445 255.4006 285.8163 281.1388 307.4195 508.3841 100 # 899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704 100
The
ifelse
solution is slower by a factor of ~4.这篇关于数据表中的条件差计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!