将时间序列中的分类列扩展为多个每秒计数列 [英] Expand Categorical Column in a Time Series to Mulitple Per Second Count Columns
问题描述
进行以下转换的最佳方法是什么?这种转换有两个部分.第一个是将速度转换为每秒平均值.第二种是将分类列转换为多列——每个分类值一列,其中值是每秒出现的次数.例如:
What is the best way to make the following transformation? There are two parts to this conversion. The first is to convert the speed to a per second mean. The second is to take the categorical column and transform that into multiple columns -- one column per categorical value where the value is the count of occurrences per second. For example:
输入(xts A):
Time(PosixCT), Observed Letter, Speed
2011/01/11 12:12:01.100,A,1
2011/01/11 12:12:01.200,A,2
2011/01/11 12:12:01.400,B,3
2011/01/11 12:12:01.800,C,4
2011/01/11 12:12:02.200,D,2
2011/01/11 12:12:02.200,A,7
输出:(xts B)
Time, A_Per_Second, B_Per_Second, C_Per_Second, D_Per_Second, Aggregate_Speed
2011/01/11 12:12:01,2,1,1,0,2.5
2011/01/11 12:12:02,1,0,0,1,4.5
我希望以不需要知道所有类别的方式来执行此操作.基本上,我试图在不丢失任何分类数据的情况下将时间折叠到每秒,并将数字数据总结为每秒平均值.
I am looking to do this in such a way that I don't need to know what all the categories are. Basically I am trying to collapsing the time to per second without loosing any of my categorical data and summarizing the numeric data as a per second mean.
推荐答案
这是我用于 A
的结构.请注意,数字"实际上是字符,因为您不能在矩阵中混合类型.
Here's the structure I'm using for A
. Note than the "numbers" are really character, since you can't mix types in a matrix.
A <- structure(c("A", "A", "B", "C", "D", "A", "1", "2", "3", "4",
"2", "7"), .Dim = c(6L, 2L), .Dimnames = list(NULL, c("Observed_Letter",
"Speed")), index = structure(c(1294769521.1, 1294769521.2, 1294769521.4,
1294769521.8, 1294769522.2, 1294769522.2), tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), .indexTZ = "",
class = c("xts", "zoo"))
此函数将清理每个类别.
This function will clean up each of the categories.
clean <- function(x) {
# construct xts object with only Speed and convert it to numeric
out <- xts(as.numeric(x$Speed),index(x))
# add column names
colnames(out) <- paste(x$Observed_Letter[1],"_Per_Second",sep="")
out # return object
}
这是您需要的内容.请注意需要显式声明 split.default
因为对于按时间拆分的 xts 对象有一个 split
方法.您也不需要 align.time
,但它会将每个周期四舍五入到整秒.否则,您的索引将是每秒索引中的最后一个实际值.
Here's the guts of what you need. Note the need to explicitly state split.default
since there's a split
method for xts objects that splits by time. You also don't need align.time
, but it will round each period up to the whole second. Otherwise your index will be the last actual value in the index for each second.
# split by Observed_Letter, apply clean() to each list element, and merge results
combA <- do.call(merge, lapply(split.default(A, A$Observed_Letter), clean))
alignA <- align.time(combA,1)
# get the last obs for each 1-second period (for period.apply)
EPalignA <- endpoints(combA, "seconds")
# count the number of non-NA observations by column for each 1-second period
counts <- period.apply(alignA, EPalignA, function(x) colSums(!is.na(x)))
# sum the non-NA observations for each column and 1-second period
values <- period.apply(alignA, EPalignA, colSums, na.rm=TRUE)
# calculate aggregate speed
B <- counts
B$Aggregate_Speed <- rowSums(values)/rowSums(counts)
这篇关于将时间序列中的分类列扩展为多个每秒计数列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!