dplyr:使用滚动时间窗口对数据进行分组和汇总/变异 [英] dplyr: grouping and summarizing/mutating data with rolling time windows

查看:244
本文介绍了dplyr:使用滚动时间窗口对数据进行分组和汇总/变异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有不规则的时间序列数据表示用户的某种交易类型。每一行数据都是时间戳的,代表当时的交易。由于数据的不规则性,一些用户一天可能有100行,其他用户可能在一天内有0或1笔交易。



数据可能看起来像这个:

  data.frame(
id = c(1,1,1,1,1,2, 2,3,4),
date = c(2015-01-01,
2015-01-01,
2015-01-05,
2015-01-25,
2015-02-15,
2015-05-05,
2015-01-01,
2015-08-01,
2015-01-01),
n_widgets = c(1,2,3,4,4,5,2,4,5)


id日期n_widgets
1 1 2015-01-01 1
2 1 2015-01-01 2
3 1 2015-01-05 3
4 1 2015-01-25 4
5 1 2015-02-15 4
6 2 2015-05-05 5
7 2 2015-01-01 2
8 3 2015-08-01 4
9 4 2015-01-01 5

我想知道一些滚动的统计数据t用户。例如:对于该用户在某一天,在过去30天内发生了多少交易,在过去30天内出售了多少个小部件等。



对应于上面的例子中,数据应该如下所示:

  id date n_widgets n_trans_30 total_widgets_30 
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015-01-05 3 3 6
4 1 2015-01-25 4 4 10
5 1 2015-02-15 4 2 8
6 2 2015-05-05 5 1 5
7 2 2015-01-01 2 1 2
8 3 2015-08-01 4 1 4
9 4 2015-01-01 5 1 5

如果时间窗口是日那么解决方案很简单: data%>%group_by(id,date)%>%summaryize(...)



同样地,如果时间窗口是月度,这也是相对简单的lubridate: data%>%group_by(id,year(date),month(date))%  %summaryize(...)



然而,我遇到的挑战是如何设置任意时段的时间窗口: 5天,10天等。



还有 RcppRoll 库,但$ RcppRoll ,而 zoo 中的滚动功能似乎更为常规时间序列设置。据我所知,根据行数而不是指定的时间段,我们可以告诉这些窗口函数的工作原理 - 主要区别在于某个时间段可能会有不同的行数,具体取决于日期和用户。



例如,对于用户1来说,有可能在 2015-01-01 之前5天内的交易数量是相当于100笔交易,对于相同的用户,前5天的交易数量 2015-02-01 等于5笔交易。因此,回头一定数量的行将无法正常工作。



此外,还有另一个SO线程讨论了不规则时间序列类型数据的滚动日期(根据滚动日期中存在的条件创建新列表)然而,接受的解决方案是使用 data.table ,我特意寻找一个 dplyr 实现这一点



我认为这个问题的核心,这个问题可以通过回答这个问题解决:我如何 group_by dplyr 中的任意时间段。或者,如果有一个不同的 dplyr 方式实现以上没有复杂的 group_by ,我该怎么办? p>

编辑:更新示例,使滚动窗口的性质更加清晰。

解决方案

p>这可以使用SQL:

 库(sqldf)

dd< - 转换(data,date = as.Date(date))
sqldf(select a。*,count(*)n_trans30,sum(b.n_widgets)'total_widgets30'
from dd a
左连接dd b在a.date-30和a.date
之间的b.date上,b.id = a.id
和b.rowid< = a.rowid
group by a.rowid)

给:

  id日期n_widgets n_trans30 total_widgets30 
1 1 2015-01-01 1 1 1
2 1 2015-01-01 2 2 3
3 1 2015 -01-05 3 3 6
4 1 2015-01-25 4 4 10
5 2 2015-05-05 5 1 5
6 2 2015-01-01 2 1 2
7 3 2015 -08-01 4 1 4
8 4 2015-01-01 5 1 5


I have irregular timeseries data representing a certain type of transaction for users. Each line of data is timestamped and represents a transaction at that time. By the irregular nature of the data some users might have 100 rows in a day and other users might have 0 or 1 transaction in a day.

The data might look something like this:

data.frame(
  id = c(1, 1, 1, 1, 1, 2, 2, 3, 4),
  date = c("2015-01-01", 
           "2015-01-01", 
           "2015-01-05", 
           "2015-01-25",
           "2015-02-15",
           "2015-05-05", 
           "2015-01-01", 
           "2015-08-01", 
           "2015-01-01"),
  n_widgets = c(1,2,3,4,4,5,2,4,5)
)

   id       date n_widgets
1  1 2015-01-01         1
2  1 2015-01-01         2
3  1 2015-01-05         3
4  1 2015-01-25         4
5  1 2015-02-15         4
6  2 2015-05-05         5
7  2 2015-01-01         2
8  3 2015-08-01         4
9  4 2015-01-01         5

Often I'd like to know some rolling statistics about users. For example: for this user on a certain day, how many transactions occurred in the previous 30 days, how many widgets were sold in the previous 30 days etc.

Corresponding to the above example, the data should look like:

   id     date    n_widgets  n_trans_30  total_widgets_30
1  1 2015-01-01         1           1             1
2  1 2015-01-01         2           2             3
3  1 2015-01-05         3           3             6
4  1 2015-01-25         4           4             10
5  1 2015-02-15         4           2             8
6  2 2015-05-05         5           1             5
7  2 2015-01-01         2           1             2
8  3 2015-08-01         4           1             4
9  4 2015-01-01         5           1             5

If the time window is daily then the solution is simple: data %>% group_by(id, date) %>% summarize(...)

Similarly if the time window is monthly this is also relatively simple with lubridate: data %>% group_by(id, year(date), month(date)) %>% summarize(...)

However the challenge I'm having is how to setup a time window for an arbitrary period: 5-days, 10-days etc.

There's also the RcppRoll library but both RcppRoll and the rolling functions in zoo seem more setup for regular time series. As far as I can tell these window functions work based on the number of rows instead of a specified time period -- the key difference is that a certain time period might have a differing number of rows depending on date and user.

For example, it's possible for user 1, that the number of transactions in the 5 days previous of 2015-01-01 is equal to 100 transactions and for the same user the number of transactions in the 5 days previous of 2015-02-01 is equal to 5 transactions. Thus looking back a set number of rows will simply not work.

Additionally, there is another SO thread discussing rolling dates for irregular time series type data (Create new column based on condition that exists within a rolling date) however the accepted solution was using data.table and I'm specifically looking for a dplyr way of achieving this.

I suppose at the heart of this issue, this problem can be solved by answering this question: how can I group_by arbitrary time periods in dplyr. Alternatively, if there's a different dplyr way to achieve above without a complicated group_by, how can I do it?

EDIT: updated example to make nature of the rolling window more clear.

解决方案

This can be done using SQL:

library(sqldf)

dd <- transform(data, date = as.Date(date))
sqldf("select a.*, count(*) n_trans30, sum(b.n_widgets) 'total_widgets30' 
       from dd a 
       left join dd b on b.date between a.date - 30 and a.date 
                         and b.id = a.id
                         and b.rowid <= a.rowid
       group by a.rowid")

giving:

  id       date n_widgets n_trans30 total_widgets30
1  1 2015-01-01         1         1               1
2  1 2015-01-01         2         2               3
3  1 2015-01-05         3         3               6
4  1 2015-01-25         4         4              10
5  2 2015-05-05         5         1               5
6  2 2015-01-01         2         1               2
7  3 2015-08-01         4         1               4
8  4 2015-01-01         5         1               5

这篇关于dplyr:使用滚动时间窗口对数据进行分组和汇总/变异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆