数据表中的条件差计算 [英] Conditional difference calculation in data table

查看:98
本文介绍了数据表中的条件差计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个百万行长的data.table约20个反式样列。这些列显示各种存储系统操作的计数器在时间上增加。有时候,计数器在被监视的系统上被复位,单个观测值低于上一个。



我需要计算一个 opsdiff 列,其将包含基于列 type 的相同类型的后续值的算术差(预期为正)。当识别计数器复位情况时 - 即。差值为负,应使用计数器的实际值。

 > dt & dt 
type opscounter
1:a 105609
2:a 106316
3:a 106705
4:a 489
5:a 723
6:a 1250
7:b 105609
8:b 106316
9:b 106705
10:b 489
11:b 723
12 :b 1250

我想得到的结果:


$ b b

 > dt 
type opscounter opsdiff
1:a 105609 NA
2:a 106316 707
3:a 106705 389
4:a 489 489
5: a 723 234
6:a 1250 527
7:b 105609 NA
8:b 106316 707
9:b 106705 389
10:b 489 489
11:b 723 234
12:b 1250 527
>

我试图构建一个基于 ifelse $ c>,但我不知道如何解决当前行的 opscounter else 部分中的值,也不知道 diff()可能是一个有效的用法:

  dt [,opsdiff:= c NA,ifelse(diff(opscounter)> 0,diff(opscounter),opscounter)),by = type] 


$ b b

如何正确计算 opsdiff 列?



问候,



Lukas

解决方案

首选解决方案是:

  dt [,opsdiff:= c(NA,diff(opscounter)),
by = type] [opsdiff< 0,opsdiff:= opscounter] []
#type opscounter opsdiff
#1:a 105609 NA
#2:a 106316 707
#3:a 106705 389
#4:a 489 489
#5:a 723 234
#6:a 1250 527
#7:b 105609 NA
#8:b 106316 707
#9:b 106705 389
#10:b 489 489
#11:b 723 234
#12:b 1250 527
pre>


请注意,为了打印结果,我已添加了 [] 并且为了说明您可以添加多个这些。


一般来说,最好避免 ifelse (特别是在您使用这种大数据集的情况下)为可以缓慢(虽然向量化),因为它评估了情况。在你的情况下,你发现另一个缺陷,你需要告诉它确切的位置,你想从 opscounter ,这增加了复杂性(见 @可以覆盖的Arwr注释)。



关于你的问题,在 DT [...] 操作的 data.table $ c>只是调用 [。data.table(DT,...)。在 data.frame 上没有任何区别;有一个类似的函数 [。data.frame


请注意, data.table 也是 data.frame 。参见 class(dt)并且还读取?data.table


为了使它更清晰,在data.table中,tacks [...] 一个接一个称为链接。这是免费的。您也可以在data.frame中执行相同操作(如下所示),但是您可以对 data.frame 执行的操作是有限的,因此使用与数据表不同。

  df < -  as.data.frame(dt)#或`setDF )`在1.9.4+中通过引用做这个
df [df $ type ==a,] [2:3,]
#type opscounter
#2 a 106316
#3 a 106705






为了说明 ifelse 的无效性,这里是一个基准:

  set.seed(123)
n< - 1e6
dt< - data.table(type = rep(c(a,b),each = n),
opscounter = sample(1:1e5,n * 2,replace = TRUE))


库(microbenchmark)
microbenchmark(
dt [,opsdiff:= c (NA,diff(opscounter)),by = type] [opsdiff <0,opsdiff:= opscounter],
dt [,opsdiff:= c(NA,ifelse (opscounter),tail(opscounter,-1L))),by = type]


#单位:毫秒
#expr
# :=`(opsdiff,c(NA,diff(opscounter)))by = type] [opsdiff < 0,`:=`(opsdiff,opscounter)]
#dt [,`:=`(opsdiff,c(NA,ifelse(diff(opscounter) -1L))),by = type]
#min lq mean median uq max neval
#228.0445 255.4006 285.8163 281.1388 307.4195 508.3841 100
#899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704 100

ifelse 解决方案的速度慢了〜4 。


I have a millions row long data.table with about 20 counter-style columns. Those columns display various storage system operations' counters increasing in time. Sometimes however, the counter gets reset on the monitored system and the single observation value is lower than a previous one.

I need to calculate a opsdiff column, which would contain an arithmetical difference (expected to be positive) of subsequent values of same type based on column type. When a counter reset situation is identified - ie. the difference is negative, the actual value of the counter should be used.

> dt <- data.table(type=rep(c("a","b"),each=6),opscounter=c(105609,106316,106705,489,723,1250))
> dt
    type opscounter
 1:    a     105609
 2:    a     106316
 3:    a     106705
 4:    a        489
 5:    a        723
 6:    a       1250
 7:    b     105609
 8:    b     106316
 9:    b     106705
10:    b        489
11:    b        723
12:    b       1250

The result I wish to get:

> dt
    type opscounter     opsdiff
 1:    a     105609          NA
 2:    a     106316         707
 3:    a     106705         389
 4:    a        489         489
 5:    a        723         234
 6:    a       1250         527
 7:    b     105609          NA
 8:    b     106316         707
 9:    b     106705         389
10:    b        489         489
11:    b        723         234
12:    b       1250         527
> 

I tried to build a construction based on ifelse() but I don't know how to address the current row's opscounter value in else part and neither the dual call of diff() is perhaps an efficient usage:

dt[,opsdiff:=c(NA, ifelse(diff(opscounter)>0, diff(opscounter), opscounter)), by=type]

How can I correctly calculate the opsdiff column?

Greetings,

Lukas

解决方案

The preferred solution would be:

dt[, opsdiff := c(NA, diff(opscounter)), 
   by = type][opsdiff < 0, opsdiff := opscounter][]
#     type opscounter opsdiff
#  1:    a     105609      NA
#  2:    a     106316     707
#  3:    a     106705     389
#  4:    a        489     489
#  5:    a        723     234
#  6:    a       1250     527
#  7:    b     105609      NA
#  8:    b     106316     707
#  9:    b     106705     389
# 10:    b        489     489
# 11:    b        723     234
# 12:    b       1250     527

Note that I've added additional [] in order to print results on the fly and in order to illustrate that you can add more than one of these.

In general, it would be better to avoid ifelse (especially in your case with such big data set) as it can to be slow (although vectorized) due to it evaluating both yes and no cases. In your case, you found another "flaw", that you need to tell it the exact locations where you want to pull opscounter from, which adds to the complexity (see @Aruns comment for possible override).

Regarding your question in comments, a data.table operation of the form DT[...] just calls the function [.data.table(DT, ...). It's not any different on a data.frame; there is a similar function [.data.frame.

Note that a data.table is also a data.frame. See class(dt) and also read ?data.table.

To make it even clearer, in data.table, tacking [...] one after another is called chaining. This is something that comes for free. You can do the same in a data.frame as well (as shown below), but the operations you can perform on a data.frame are limited, and therefore the use of chaining itself, unlike data.table.

df <- as.data.frame(dt) # or `setDF(dt)` in 1.9.4+ to do this by reference
df[df$type == "a", ][2:3, ]
#   type opscounter
# 2    a     106316
# 3    a     106705


Finally, in order to illustrate the ineffectiveness of ifelse, here is a benchmark:

set.seed(123)
n <- 1e6
dt <- data.table(type = rep(c("a","b"), each = n), 
                 opscounter = sample(1:1e5, n*2, replace = TRUE))


library(microbenchmark)
microbenchmark(
dt[, opsdiff := c(NA, diff(opscounter)), by = type][opsdiff < 0, opsdiff := opscounter],
dt[, opsdiff := c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L))), by=type]
)

# Unit: milliseconds
#                                                                                                         expr
#             dt[, `:=`(opsdiff, c(NA, diff(opscounter))), by = type][opsdiff <  0, `:=`(opsdiff, opscounter)]
# dt[, `:=`(opsdiff, c(NA, ifelse(diff(opscounter) > 0, diff(opscounter), tail(opscounter, -1L)))), by = type]
#      min       lq      mean    median        uq       max neval
# 228.0445 255.4006  285.8163  281.1388  307.4195  508.3841   100
# 899.1222 990.1478 1085.5492 1048.3704 1095.7179 1740.5704   100

The ifelse solution is slower by a factor of ~4.

这篇关于数据表中的条件差计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆