在任意时间范围内聚合(计数)值的出现 [英] Aggregate (count) occurences of values over arbitrary timeframe

查看:27
本文介绍了在任意时间范围内聚合(计数)值的出现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 CSV 文件,其中包含此时发生的时间戳和某些事件类型.我想要的是以 6 分钟为间隔计算某些事件类型的出现次数.

I have a CSV file with timestamps and certain event-types which happened at this time. What I want is count the number of occurences of certain event-types in 6-minutes intervals.

输入数据看起来像:

date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"

我用这段代码加载并处理数据:

I load and cure the data with this piece of code:

> raw_data <- read.csv('input.csv')
> cured_dates <- c(strptime(raw_data$date, '%b %d, %Y %H:%M:%S', tz="CEST"))
> cured_data <- data.frame(cured_dates, c(raw_data$type))
> colnames(cured_data) <- c('date', 'type')

固化后的数据如下:

> head(cured_data)
                 date type
1 2011-09-22 14:54:53    2
2 2011-09-22 14:54:53    2
3 2011-09-22 14:54:53    2
4 2011-09-22 14:54:53    2
5 2011-09-22 14:54:53    1
6 2011-09-22 14:54:53    1

我阅读了很多 xts 和 zoo 的示例,但不知何故我无法掌握它.输出数据应类似于:

I read a lot of samples for xts and zoo, but somehow I can't get a hang on it. The output data should look something like:

date                       type   count
2011-09-22 14:54:00 CEST   1      11
2011-09-22 14:54:00 CEST   2      19
2011-09-22 15:00:00 CEST   1      9
2011-09-22 15:00:00 CEST   2      12
2011-09-22 15:06:00 CEST   1      23
2011-09-22 15:06:00 CEST   2      18

Zoo 的聚合函数看起来很有前途,我发现了这个代码片段:

Zoo's aggregate function looks promising, I found this code-snippet:

# aggregate POSIXct seconds data every 10 minutes
tt <- seq(10, 2000, 10)
x <- zoo(tt, structure(tt, class = c("POSIXt", "POSIXct")))
aggregate(x, time(x) - as.numeric(time(x)) %% 600, mean)

现在我只是想知道如何将其应用到我的用例中.

Now I'm just wondering how I could apply this on my use case.

我尝试过的天真:

> zoo_data <- zoo(cured_data$type, structure(cured_data$time, class = c("POSIXt", "POSIXct")))
> aggr_data = aggregate(zoo_data$type, time(zoo_data$time), - as.numeric(time(zoo_data$time)) %% 360, count)
Error in `$.zoo`(zoo_data, type) : not possible for univariate zoo series

我必须承认我对 R 并不是很有信心,但我会尝试.:-)

I must admit that I'm not really confident in R, but I try. :-)

我有点迷茫.有人能指出我正确的方向吗?

I'm kinda lost. Could anyone point me into the right direction?

非常感谢!干杯,亚历克斯.

Thanks a lot! Cheers, Alex.

这里是我的一小部分数据的 dput 输出.数据本身大约有 8000 万行.

Here the output of dput for a small subset of my data. The data itself is something around 80 million rows.

structure(list(date = structure(c(1316697885, 1316697885, 1316697885, 
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 
1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 1316697885, 
1316697885, 1316697885), class = c("POSIXct", "POSIXt"), tzone = ""), 
    type = c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 
    1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L)), .Names = c("date", 
"type"), row.names = c(NA, -23L), class = "data.frame")

推荐答案

我们可以使用 read.csv 来读取它,将第一列转换成一个日期时间,以 6 分钟为间隔并添加一个虚拟1 的列.然后使用 read.zoo 在类型上拆分并在虚拟列上聚合:

We can read it using read.csv, convert the first column to a date time binned into 6 minute intervals and add a dummy column of 1's. Then re-read it using read.zoo splitting on the type and aggregating on the dummy column:

# test data

Lines <- 'date,type
"Sep 22, 2011 12:54:53.081240000","2"
"Sep 22, 2011 12:54:53.083493000","2"
"Sep 22, 2011 12:54:53.084025000","2"
"Sep 22, 2011 12:54:53.086493000","2"
"Sep 22, 2011 12:54:53.081240000","3"
"Sep 22, 2011 12:54:53.083493000","3"
"Sep 22, 2011 12:54:53.084025000","3"
"Sep 22, 2011 12:54:53.086493000","4"'

library(zoo)
library(chron)

# convert to chron and bin into 6 minute bins using trunc
# Also add a dummy column of 1's 
# and remove any leading space (removing space not needed if there is none)

DF <- read.csv(textConnection(Lines), as.is = TRUE)
fmt <- '%b %d, %Y %H:%M:%S'
DF <- transform(DF, dummy = 1,
         date = trunc(as.chron(sub("^ *", "", date), format = fmt), "00:06:00"))

# split and aggregate

z <- read.zoo(DF, split = 2, aggregate = length)

有了上面的测试数据,解决方案看起来像这样:

With the above test data the solution looks like this:

> z
                    2 3 4
(09/22/11 12:54:00) 4 3 1

请注意,上述内容是以宽形式完成的,因为该形式构成了时间序列,而长形式则不然.每种类型都有一列.在我们的测试数据中,我们有类型 2、3 和 4,因此有三列.

Note that the above has been done in wide form since that form constitutes a time series whereas the long form does not. There is one column for each type. In our test data we had types 2, 3 and 4 so there are three columns.

(我们在这里使用了 chron,因为它的 trunc 方法非常适合将分箱分成 6 分钟组.chron 不支持时区,这可能是一个优势,因为您无法选择其中之一可能的时区错误,但如果你想要 POSIXct 无论如何在最后转换它,例如 time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z))) %% 1)) . 除了我们使用 as.Date.dates 而不是 之外,该表达式显示在其中一篇 R News 4/1 文章的表格中as.Date 来解决从那时起似乎已经引入的错误.我们也可以使用 time(z) <- as.POSIXct(time(z)) 但是将导致不同的时区.)

(We have used chron here since its trunc method fits well with binning into 6 minute groups. chron does not support time zones which can be an advantage since you can't make one of the many possible time zone errors but if you want POSIXct anyways convert it at the end, e.g. time(z) <- as.POSIXct(paste(as.Date.dates(time(z)), times(time(z)) %% 1)) . This expression is shown in a table in one of the R News 4/1 articles except we used as.Date.dates instead of just as.Date to work around a bug that seems to have been introduced since then. We could also use time(z) <- as.POSIXct(time(z)) but that would result in a different time zone.)

最初的解决方案是按日期划分的,但后来我注意到您希望将其划分为 6 分钟,因此对解决方案进行了修订.

The original solution binned into dates but I noticed afterwards that you wish to bin into 6 minute periods so the solution was revised.

根据评论修改.

这篇关于在任意时间范围内聚合(计数)值的出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆