使用滚动时间间隔来计算R和dplyr中的行 [英] Using a rolling time interval to count rows in R and dplyr

查看:126
本文介绍了使用滚动时间间隔来计算R和dplyr中的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个时间戳的数据框,并且那时出售相应的票数。

  Timestamp ticket_count 
(时间)(int)
1 2016-01-01 05:30:00 1
2 2016-01-01 05:32:00 1
3 2016-01-01 05:38:00 1
4 2016-01-01 05:46:00 1
5 2016-01-01 05:47:00 1
6 2016-01-01 06: 07:00 1
7 2016-01-01 06:13:00 2
8 2016-01-01 06:21:00 1
9 2016-01-01 06:22: 00 1
10 2016-01-01 06:25:00 1

我想知道如何计算在一定时间内销售的门票数量。例如,我想计算所有门票后15分钟内售出的票数。在这种情况下,第一行将有三张票,第二行将有四张票等。



理想情况下,我正在寻找一个dplyr解决方案,因为我想要为具有 group_by()功能的多个商店执行此操作。但是,我有一些麻烦,找出如何固定给定行的每个时间戳,同时通过dplyr语法搜索所有Timestamps。

解决方案

当前开发版本 data.table ,v1.9.7, non-equi 连接被实现。假设您的 data.frame 被称为 df Timestamp 列为 POSIXct 类型:

  require(data.table)#v1 .9.7+ 
window = 15L#分钟
(计数= setDT(df)[。(t =时间戳+窗口* 60L),on =((Timestamp 。 count = sum(ticket_count)),by = .EACHI] $ counting)
#[1] 3 4 5 5 5 9 11 11 11 11

#将其添加为原始列data.table参考文献
df [,计数:=计数]

t ,所有行 df $ Timestamp< that_row 被提取。而 by = .EACHI 指示的表达式 sum(ticket_count)运行。这将给您所需的结果。



希望这有帮助。


Let's say I have a dataframe of timestamps with the corresponding number of tickets sold at that time.

         Timestamp          ticket_count
            (time)              (int)
1  2016-01-01 05:30:00            1
2  2016-01-01 05:32:00            1
3  2016-01-01 05:38:00            1
4  2016-01-01 05:46:00            1
5  2016-01-01 05:47:00            1
6  2016-01-01 06:07:00            1
7  2016-01-01 06:13:00            2
8  2016-01-01 06:21:00            1
9  2016-01-01 06:22:00            1
10 2016-01-01 06:25:00            1

I want to know how to calculate the number of tickets sold within a certain time frame of all tickets. For example, I want to calculate the number of tickets sold up to 15 minutes after all tickets. In this case, the first row would have three tickets, the second row would have four tickets, etc.

Ideally, I'm looking for a dplyr solution, as I want to do this for multiple stores with a group_by() function. However, I'm having a little trouble figuring out how to hold each Timestamp fixed for a given row while simultaneously searching through all Timestamps via dplyr syntax.

解决方案

In the current development version of data.table, v1.9.7, non-equi joins are implemented. Assuming your data.frame is called df and the Timestamp column is POSIXct type:

require(data.table) # v1.9.7+
window = 15L # minutes
(counts = setDT(df)[.(t=Timestamp+window*60L), on=.(Timestamp<t), 
                     .(counts=sum(ticket_count)), by=.EACHI]$counts)
#  [1]  3  4  5  5  5  9 11 11 11 11

# add that as a column to original data.table by reference
df[, counts := counts]

For each row in t, all rows where df$Timestamp < that_row is fetched. And by=.EACHI instructs the expression sum(ticket_count) to run for each row in t. That gives your desired result.

Hope this helps.

这篇关于使用滚动时间间隔来计算R和dplyr中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆