如何使用dplyr添加灵活的delta列? [英] How to add flexible delta columns using dplyr?

查看:125
本文介绍了如何使用dplyr添加灵活的delta列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用dplyr向数据集添加delta列。 delta将被计算为当前行值和上一行的值之间的差。挑战是,前面的行不一定是正确的,因为需要一些过滤。



考虑这个数据集:

  LEVEL,TIME 
3,0000
2,0010
2,0020
1,0030
2, 0040
3,0050

我想添加一个新的列,DELTA,包含差异在与相同LEVEL或更高版本的行之间的TIME值与之前的TIME值之间。也就是说,我不想和前面的行进行比较,我想向后搜索并跳过具有较低LEVEL的任何行。



对于这个例子,预期的输出将是:

  LEVEL,TIME,DELTA 
3,0000,NA
2,0010,10
2,0020,10
1,0030,10
2,0040,20
3,0050,50

这可以直接用dplyr完成吗? (或另外吗?)

我想要一个高效的解决方案,因为我的真实数据集大约有十亿行,有七个时间戳列(但只有一个级别)。



(背景:数据来自使用CPU提供的许多时间源的软件应用程序日志文件,例如循环,指令和L1 / L2 / L3 / DRAM访问计数器。测量事件之间的经过时间,较低级别的消息不是单独的事件,而是细节细节。)



编辑新信息: / strong>



我用dplyr尝试的解决方案都没有与我的百万元素数据集一起工作。他们似乎很慢,炸毁了R进程。



我已经回到学习一些基础R并写下一个相当实用的(约2秒的1M行数据框架)实现:

 级别<  -  c(3,2,2,1,2,3,6, 4,7,8,2)#回收到1M元素,低于
时间< - seq(0,10000000,10)

#参考时间戳累加器用于更新内部关闭。
#index是日志级别,值是delta的参考时间戳。
ref< - numeric(9)
f< - function(level,time){
delta< - time-ref [level]
ref [1:level] < - 时间
delta
}

delta< - mapply(f,级别,时间)

这是否合理?有一个可比的dplyr解决方案?



我基本满意。我觉得这应该是〜10倍快,每个矢量元素〜5000个CPU周期似乎有点疯狂,但它适用于我,在解释器的上下文中可能是合理的,它是复制 ref 累加器在每一步。



EDIT2:反思这个表达式的表现有点拖累。我想要一个10倍的速度,如果可能的话!

解决方案

我加入了data.frame本身。然后选择符合条件的所有行。然后选择最匹配的行。
要获得结果中相同数量的行(第一行中的NA),我再次加入基础data.frame( right_join )。

  LEVEL<  -  c(3,2,2,1,2,3)
TIME< - c('0000' ,'0010','0020','0030','0040','0050')

df< - data.frame(LEVEL,TIME,stringsAsFactors = F)

df%>%
merge(df,by = NULL,all = T)%>%
filter(LEVEL.y> = LEVEL.x& TIME.x> TIME.y)%>%
group_by(TIME.x,LEVEL.x)%>%
过滤器(row_number(desc(TIME.y))== 1)%>%
mutate(delta = as.numeric(TIME.x) - as.numeric(TIME.y))%>%
重命名(LEVEL = LEVEL.x,TIME = TIME.x)%> %
选择(TIME,LEVEL,delta)%>%
right_join(df)

另一种方法是计算每个组的 min(delta),而不是排序和选择第一行。我更喜欢上述解决方案,因为您可以使用匹配行的其他信息。

  df%>%merge (df,by = NULL,all = T)%>%
过滤器(LEVEL.y> = LEVEL.x& TIME.x> TIME.y)%>%
group_by (TIME.x,LEVEL.x)%>%
总结(delta = min(as.numeric(TIME.x) - as.numeric(TIME.y)))%>%
重命名(LEVEL = LEVEL.x,TIME = TIME.x)%>%
选择(TIME,LEVEL,delta)%>%
right_join(df)


I would like to use dplyr to add a "delta" column to a dataset. The delta would be computed as the difference between the current row value and the value from a previous row. The challenge is that the immediately preceeding row is not necessarily the right one because some filtering is needed.

Consider this dataset:

LEVEL, TIME
3,     0000
2,     0010
2,     0020
1,     0030
2,     0040
3,     0050

I want to add a new column, DELTA, containing the difference between the TIME value compared with the previous TIME value for a row with the same LEVEL or greater. That is, instead of comparing with the immediately preceeding row, I would like to search backwards and skip over any rows with a lower LEVEL.

For this example the expected output would be:

LEVEL, TIME, DELTA
3,     0000, NA
2,     0010, 10
2,     0020, 10
1,     0030, 10
2,     0040, 20
3,     0050, 50

Can this be done straightforwardly with dplyr? (Or otherwise?)

I would like an efficient solution because my real dataset is approximately one billion rows and has seven timestamp columns (but only one level.)

(Background: The data is from a software application log file using many time sources available from the CPU e.g. cycles, instructions, and L1/L2/L3/DRAM access counters. I want to measure the elapsed time between events. The messages with lower levels are not separate preceeding events but rather finer-grained details.)

EDIT WITH NEW INFORMATION:

None of the solutions I have tried with dplyr actually work with my million-element data set. They seem to be slow and to blow up the R process.

I have fallen back to learning some base R and writing a reasonably practical (~2 seconds for 1M row data frame) implementation like this:

level <- c(3,2,2,1,2,3,6,4,7,8,2) # recycled to 1M elements, below
time <- seq(0, 10000000, 10)

# reference timestamp accumulator for update inside closure.
# index is log level and value is reference timestamp for delta.
ref <- numeric(9)
f <- function(level, time) {
  delta <- time - ref[level]
  ref[1:level] <<- time
  delta
}

delta <- mapply(f, level, time)

Is this reasonable? is there a comparable dplyr solution?

I am basically satisfied. I do feel like this should be ~10x faster, ~5000 CPU cycles per vector element seems a bit insane, but it works for me and is perhaps reasonable in the context of an interpreter that is copying the ref accumulator on each step.

EDIT2: On reflection the performance of this formulation is a bit of a drag. I would like a 10x speedup if possible!

解决方案

I join the data.frame on itself. Then select all rows that meet your criteria. Then select the closest matching row. To get the same amount of rows in the result (NA in first row) I again join the base data.frame (right_join).

LEVEL <- c(3,2,2,1,2,3)
TIME <- c('0000','0010','0020','0030','0040','0050')

df <- data.frame(LEVEL, TIME, stringsAsFactors = F)

df %>%  
  merge(df, by = NULL, all=T) %>%  
  filter(LEVEL.y >= LEVEL.x & TIME.x > TIME.y) %>%
  group_by(TIME.x, LEVEL.x) %>% 
  filter(row_number(desc(TIME.y))==1) %>%
  mutate(delta = as.numeric(TIME.x) - as.numeric(TIME.y)) %>%
  rename(LEVEL = LEVEL.x, TIME=TIME.x) %>%  
  select(TIME, LEVEL, delta) %>%
  right_join(df)

Another approach would be to get calculate the min(delta) for every group, instead of ordering and selecting the first row. I prefer above solution, because you can then use the other information of the matching row as well.

df %>% merge(df, by = NULL, all=T) %>%  
  filter(LEVEL.y >= LEVEL.x & TIME.x > TIME.y) %>%
  group_by(TIME.x, LEVEL.x) %>%  
  summarise(delta = min(as.numeric(TIME.x) - as.numeric(TIME.y))) %>%
  rename(LEVEL = LEVEL.x, TIME=TIME.x) %>%  
  select(TIME, LEVEL, delta) %>%
  right_join(df)

这篇关于如何使用dplyr添加灵活的delta列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆