按连续值和组聚合 [英] aggregate by consecutive values and group
问题描述
在以下数据集中,我按自行车计数为零的实例过滤了 JSON 间隔.station_summary_id
表示一个时间间隔,并按连续整数递增(在示例中您看到 64129 与2014-10-01 07:00:00"相关联,然后 64130 与2014-10"相关联-01 07:10:00" 等等.station_id
是一个站的唯一ID.
我的目标是:通过 station_id
找到最长的连续整数链 - 换句话说 - 找出每个站空的最长时间段.我知道这需要首先按 station_id
分组,然后计算 station_summary_id
中最长的连续序列,但我不确定如何为所有站 id 自动执行此操作.
可重现的例子:
<前>> dput(dat)结构(列表(站 ID = c(2L,2L,2L,2L,2L,2L,2L,2L,2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,4L, 4L, 4L), 状态 = 结构(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L), .Label = "Active", class = "factor"), available_bike_count = c(0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), station_summary_id = c(64129L,64130L、64131L、64132L、64133L、64134L、64136L、64138L、64139L、64140L、64141L、64142L、64143L、64144L、64145L、64146L、64147L、64148L、64149L、64150L、64152L、64161L、64162L、64170L、64273L、64322L、64324L、64341L、64884L、64886L、64896L、64897L、64898L、64899L、64900L、64901L、64902L、64903L、64904L、64905L、64906L、64907L、64908L、64909L、64910L、64911L、64912L、64913L、64917L、64918L、65214L、65219L、66314L、66439L、66450L、66583L、66587L、66589L、66600L、66872L、66880L、67037L、67048L、82854L、82855L、82856L、82857L、82858L、82859L、82860L、82861L、82862L、82863L、82867L, 82868L), 月 = c(10L, 10L, 10L, 10L, 10L, 10L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,10L, 10L, 10L), 年份 = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L,2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2014L、2015L、2015L、2015L、2015L、2015L、2015L、2015L, 2015L, 2015L, 2015L, 2015L, 2015L)), .Names = c("station_id","status", "available_bike_count", "station_summary_id", "month","年"), row.names = c(NA, -75L), class = "data.frame")请参阅 ?rle
以更好地理解行程编码的可能用途.
使用您的新数据:
<代码>>最大(rle(差异(dat$station_summary_id))$lengths)[1] 12
在修改后的示例中使用多个 station_id,我发现 aggregate
工作得相当好:
aggregate(dat$station_summary_id, dat['station_id'], FUN= function(d) max( rle( diff(d) )$lengths ) )#---------station_id x1 2 122 3 173 4 9
这也成功了 data.table 语法:
dat <- setDT(dat)dat[, max( rle( diff(station_summary_id) )$lengths ) , by='station_id']#-----station_id V11:2 122:3 173:4 9
In the following data set I have filtered JSON intervals by instances where the bike count is equal to zero. station_summary_id
represents one time interval and increases by consecutive integers (in the example you see that 64129 is associated with "2014-10-01 07:00:00", then 64130 is associated with "2014-10-01 07:10:00" , and so on. station_id
is the unique id of a station.
My objective is: to find the longest chain of consecutive integers by station_id
-in other words - to find out the longest time period that each station was empty. I understand that this requires grouping first by station_id
and then counting the longest consecutive sequence in station_summary_id
but am not sure how to automate this for all station ids.
> dim(data) [1] 307039 7 > head(data) station_id status available_bike_count created_at station_summary_id month year 13694 2 Active 0 2014-10-01 07:00:00 64129 10 2014 13702 10 Active 0 2014-10-01 07:00:00 64129 10 2014 13706 14 Active 0 2014-10-01 07:00:00 64129 10 2014 13710 18 Active 0 2014-10-01 07:00:00 64129 10 2014 13713 21 Active 0 2014-10-01 07:00:00 64129 10 2014 13728 36 Active 0 2014-10-01 07:00:00 64129 10 2014
reproducible example:
> dput(dat) structure(list(station_id = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), status = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Active", class = "factor"), available_bike_count = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), station_summary_id = c(64129L, 64130L, 64131L, 64132L, 64133L, 64134L, 64136L, 64138L, 64139L, 64140L, 64141L, 64142L, 64143L, 64144L, 64145L, 64146L, 64147L, 64148L, 64149L, 64150L, 64152L, 64161L, 64162L, 64170L, 64273L, 64322L, 64324L, 64341L, 64884L, 64886L, 64896L, 64897L, 64898L, 64899L, 64900L, 64901L, 64902L, 64903L, 64904L, 64905L, 64906L, 64907L, 64908L, 64909L, 64910L, 64911L, 64912L, 64913L, 64917L, 64918L, 65214L, 65219L, 66314L, 66439L, 66450L, 66583L, 66587L, 66589L, 66600L, 66872L, 66880L, 67037L, 67048L, 82854L, 82855L, 82856L, 82857L, 82858L, 82859L, 82860L, 82861L, 82862L, 82863L, 82867L, 82868L), month = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L), year = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L)), .Names = c("station_id", "status", "available_bike_count", "station_summary_id", "month", "year"), row.names = c(NA, -75L), class = "data.frame")
See ?rle
for better understanding of possible uses of run length encoding.
With your new data:
> max( rle( diff(dat$station_summary_id) )$lengths )
[1] 12
With multiple station_id's in the revised example, I found that aggregate
worked fairly well:
aggregate( dat$station_summary_id, dat['station_id'], FUN= function(d) max( rle( diff(d) )$lengths ) )
#---------
station_id x
1 2 12
2 3 17
3 4 9
This also succeed with data.table syntax:
dat <- setDT(dat)
dat[, max( rle( diff(station_summary_id) )$lengths ) , by='station_id']
#-----
station_id V1
1: 2 12
2: 3 17
3: 4 9
这篇关于按连续值和组聚合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!