如何使用R包data.table和滚动联接查找最后一个或下一个条目 [英] How to find the last or next entry using R package data.table and rolling joins

查看:155
本文介绍了如何使用R包data.table和滚动联接查找最后一个或下一个条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 假设我有一个这样的数据表。 

customer_id time_stamp value
1:1 223 4
2:1 252 1
3:1 456 3
4:2 455 5
5:2 632 2

因此customer_id和time_stamp一起形成唯一键。我想添加一些新的列,指示value的上一个和最后一个值。也就是说,我想要这样的输出。

  customer_id time_stamp value value_PREV value_NEXT 
1:1 223 4 NA 1
2:1 252 1 4 3
3:1 456 3 1 NA
4:2 455 5 NA 2
5:2 632 2 5 NA

我想要这个快速,并且处理稀疏,不规则的时间。我认为data.table滚动连接将为我做。然而滚动连接似乎找到最后一次或同时。所以如果你对同一个表的两个副本(在将副本的列名添加_PREV之后)执行滚动连接,这不会工作。你可以通过添加一个微小的数字到副本的时间变量,但这是有点尴尬。



有没有办法简单地用rollin join或其他data.table方法?我发现一个有效的方法,但它仍然需要大约40行R代码。看起来这可能是一个单线,如果滚动连接可以告诉寻找最后一次不包括相同的时间。或者也许还有一些其他的巧妙。



这是示例数据。

  data = data.table(customer_id = c(1,2,1,1,2),time_stamp = c(252,632,456,223,455 ),value = c(1,2,3,4,5))
data_sorted = data [order(customer_id,time_stamp)]




这是我写的代码。注意,将NA放入customer $不同的
会抛出警告,可能需要更改。
我已经在下面注释掉了。

  add_prev_next_cbind< -function(data,ident =customer_id,timecol = time_stamp,prev_tag =PREV,
next_tag =NEXT,sep =_){
o = order(data [[ident]],data [[timecol]])
uo = order(o)
data = data [o,]
Nrow = nrow(data)
Ncol = ncol(data)
#shift it,第一行
data_prev = data [c(1,1:(Nrow-1)),]
#shift it,将任何垃圾放在最后一行
data_next = data [c 2:(Nrow),Nrow),]
#flag标识改变的行,这些获得NA
prev_diff = data [[ident]]!= data_prev [[ident]]
prev_diff [1] = T
next_diff = data [[ident]]!= data_next [[ident]]
next_diff [Nrow] = T
#change names
names = names (数据)
names_prev =粘贴(名称,prev_tag,sep = sep)
names_next =粘贴(names,next_tag,sep = sep)
setnames(data_prev,names,names_prev)
setnames(data_next,names,names_next)
#put NA在上一行和下一行来自不同的标识
#replace下面两行用别的
#data_prev [prev_diff,] < -NA
#data_next [next_diff,] <-NA
data_all = cbind(data,data_prev,data_next)
data_all = data_all [uo,]
return )
}


解决方案

更新: =https://github.com/Rdatatable/data.table/issues/965 =nofollow>#965 现在在 1.9.5 。从新闻





  1. 新功能 shift()实现快速 ,, data.frames data.tables 的引导/延迟 。它需要一个类型参数,可以是lag(默认)或lead列表,这使得它与:= set()一起使用非常方便。例如: DT [,(cols):= shift(.SD,1L),by = id] 。请查看?shift 了解详情。


现在我们可以这样做:

  dt [,c(value_PREV,value_NEXT): c(shift(value,1L,type =lag),
shift(value,1L,type =lead)),by = customer_id]
pre>




在这里你不需要滚动联接。您可以使用 head tail 来执行此操作。假设您的 data.table 是DT:

  setkey customer_id)
DT [,list(time_stamp = time_stamp,
prev.val = c(NA,head(value,-1)),
next.val = c ,-1),NA)),
by = customer_id]
#customer_id time_stamp prev.val next.val
#1:1 223 NA 1
#2:1 252 4 3
#3:1 456 1 NA
#4:2 455 NA 2
#5:2 632 5 NA

修改:更好:

  DT [,`:=`(prev.val = c(NA,head(value,-1)),
next.val = c(tail(value,-1),NA)),
by = customer_id]


Lets say I have a data table like this.

   customer_id time_stamp value
1:           1        223     4
2:           1        252     1
3:           1        456     3
4:           2        455     5
5:           2        632     2

So that customer_id and time_stamp together form a unique key. I want to add some new columns indicating the previous and last values of "value". That is, I want output like this.

  customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

I want this to be fast and work with sparse, irregular times. I thought that the data.table rolling join would do it for me. However the rolling join appears to find the last time OR same time. So if you do a rolling join on two copies of the same table (after adding _PREV to the column names of the copy), this doesn't quite work. You can fudge it by adding a tiny number to the time variable of the copy but this is kinda awkward.

Is there a way to do this simply with rollin join or some other data.table method? I've found an efficient way but it still requires about 40 lines of R code. It seems that this could be a one-liner if rolling join could be told to look for the last time NOT including the same time. Or maybe there is some other neat trick.

Here is the example data.

data=data.table(customer_id=c(1,2,1,1,2),time_stamp=c(252,632,456,223,455),value=c(1,2,3,4,5))
data_sorted=data[order(customer_id,time_stamp)]


This is the code I wrote. Note that the lines putting NA into the ones where customer_id differ throws a warning and probably needs changing. I have them commented out below. Anyone have any suggestions for replacing those two lines?

add_prev_next_cbind<-function(data,ident="customer_id",timecol="time_stamp",prev_tag="PREV",
                   next_tag="NEXT",sep="_"){
  o=order(data[[ident]],data[[timecol]])
  uo=order(o)
  data=data[o,]
  Nrow=nrow(data)
  Ncol=ncol(data)
  #shift it, put any junk in the first row
  data_prev=data[c(1,1:(Nrow-1)),]
  #shift it, put any junk in the last row
  data_next=data[c(2:(Nrow),Nrow),]
  #flag the rows where the identity changes, these get NA
  prev_diff=data[[ident]] != data_prev[[ident]]
  prev_diff[1]=T
  next_diff=data[[ident]] != data_next[[ident]]  
  next_diff[Nrow]=T
  #change names
  names=names(data)
  names_prev=paste(names,prev_tag,sep=sep)
  names_next=paste(names,next_tag,sep=sep)
  setnames(data_prev,names,names_prev)
  setnames(data_next,names,names_next)
  #put NA in rows where prev and next are from a different ident
  #replace the next two lines with something else
  #data_prev[prev_diff,]<-NA
  #data_next[next_diff,]<-NA
  data_all=cbind(data,data_prev,data_next)
  data_all=data_all[uo,]
  return(data_all)
}

解决方案

Update: #965 is now implemented in 1.9.5. From NEWS:

  1. New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead" and always returns a list, which makes it very convenient to use it along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

Now we can therefore do:

dt[, c("value_PREV", "value_NEXT") := c(shift(value, 1L, type="lag"), 
                     shift(value, 1L, type="lead")), by=customer_id]


You don't need a roll join here at all. you can do this with head and tail. Assuming your data.table is DT:

setkey(DT, "customer_id")
DT[, list(time_stamp = time_stamp, 
          prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
by=customer_id]
#   customer_id time_stamp prev.val next.val
# 1:           1        223       NA        1
# 2:           1        252        4        3
# 3:           1        456        1       NA
# 4:           2        455       NA        2
# 5:           2        632        5       NA

Edit: Even better:

DT[, `:=`(prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
          by=customer_id]

这篇关于如何使用R包data.table和滚动联接查找最后一个或下一个条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆