如何使用回溯窗口在data.table中获取快速摘要? [英] How to get quick summary in data.table with a look-back window?
问题描述
此问题是基于如何快速获取data.table中计数的摘要。
类似地,这是功能工程的一部分,该功能通过在一定的时间窗口内进行回顾,根据称为Col 的列汇总每个ID。相同的预处理将应用于测试集。由于数据集很大,因此可能更优选基于数据表的解决方案。
Similarly, this is a part of feature engineering that summarizes each ID depending on column called Col by looking back with certain time window. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.
训练输入:
ID Time Col Count
A 2017-06-05 M 1
A 2017-06-02 M 1
A 2017-06-03 M 1
B 2017-06-02 K 1
B 2017-06-01 M 4
通过应用两个回顾过去的日子,我们有:
By applying two days looking back, we have:
ID Time Time-2D Col Count
A 2017-06-05 2017-06-03 M 1 #Time-2D by moving time two days back
A 2017-06-02 2017-05-31 M 1
A 2017-06-03 2017-06-01 M 1
B 2017-06-02 2017-05-31 K 1
B 2017-06-01 2017-05-30 M 4
预期的输出(计数)
ID Time Time-2D Col_M Col_K
A 2017-06-05 2017-06-03 1 0 #from 2017-06-03 to 2017-06-05, for user A, there are 0 (sum(count)) of K and 1 (sum(count)) of M.
A 2017-06-02 2017-05-31 1 0
A 2017-06-03 2017-06-01 2 0 # 2 is because from 06-01 to 06-03, there is two rows in the first table (A 2017-06-02 M 1; A 2017-06-03 M 1) that the count summarization on M is 2.
B 2017-06-02 2017-05-31 0 1
B 2017-06-01 2017-05-30 4 0
2。计算比率
根据上表,
预期产出(比率):
2. Calculate ratio
Based on above table, Expected output (ratio):
ID Time Time-2D Col_M Col_K
A 2017-06-05 2017-06-03 1 0 # 1/sum(1+0)
A 2017-06-02 2017-05-31 1 0
A 2017-06-03 2017-06-01 1 0 #2/sum(2+0)
B 2017-06-02 2017-05-31 0 1
B 2017-06-01 2017-05-30 1 0 # 4/sum(4+0)
以上用于处理训练数据。对于测试数据集,如果需要映射到Col_M,Col_K,则意味着,如果其他值(如S)出现在Col中,它将被忽略。
Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.
推荐答案
我想我理解您的要求。您似乎关心观察的顺序,例如,第二个观察 Time
是否在第一个观察 Time $ c之前$ c>。这没有多大意义,但是为了达到此目的,这里提供了一种高效的data.table解决方案。这基本上是通过
ID
, Col
和都 Time
列和行索引(基本上是显示顺序)。之后,只需 dcast
即可从长转换为宽(就像您上一个问题一样)。请注意,结果按日期排序,但是我保留了 rowindx
变量,因此您可以使用 setorder $ c重新排序。 $ c>。另外,我将比率calc保留给您,因为这是非常基本的(提示-不要使用循环,它是完全矢量化的一个衬里)
I think I understand your request. You seem to care about the order of the observations regardless if, for instance, the second observations Time
is prior to the first observations Time
. That doesn't make much sense, but here is a quit efficient data.table solution in order to achieve this. This is basically does a non-equi join by ID
, Col
, Both Time
columns and the row index (which is basically the appearance order). Afterwards, it just dcast
to convert from long to wide (like in your previous question). Note that the result is ordered by the dates, but I've kept the rowindx
variable, so you can reorder it back using setorder
. Also, I'll keep the ratio calc to you as this is very basic (hint - Don't use loops, it is a fully vectorized one liner)
library(data.table) #v1.10.4+
## Read the data
DT <- fread("ID Time Col Count
A 2017-06-05 M 1
A 2017-06-02 M 1
A 2017-06-03 M 1
B 2017-06-02 K 1
B 2017-06-01 M 4")
## Prepare the variables we need for the join
DT[, Time := as.IDate(Time)]
DT[, Time_2D := Time - 2L]
DT[, rowindx := .I]
## Non-equi join, sum `Count` by each join
DT2 <- DT[DT,
sum(Count),
on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D),
by = .EACHI]
## Fix column names (a known issue)
setnames(DT2, make.unique(names(DT2)))
## Long to wide (You can reorder back using `rowindx` and `setorder` function)
dcast(DT2, ID + Time + Time.1 + rowindx ~ Col, value.var = "V1", fill = 0)
# ID Time Time.1 rowindx K M
# 1: A 2017-06-02 2017-05-31 2 0 1
# 2: A 2017-06-03 2017-06-01 3 0 2
# 3: A 2017-06-05 2017-06-03 1 0 1
# 4: B 2017-06-01 2017-05-30 5 0 4
# 5: B 2017-06-02 2017-05-31 4 1 0
这篇关于如何使用回溯窗口在data.table中获取快速摘要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!