如何使用回溯窗口在data.table中获取快速摘要? [英] How to get quick summary in data.table with a look-back window?

查看:94
本文介绍了如何使用回溯窗口在data.table中获取快速摘要?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是基于如何快速获取data.table中计数的摘要

类似地,这是功能工程的一部分,该功能通过在一定的时间窗口内进行回顾,根据称为Col 的列汇总每个ID。相同的预处理将应用于测试集。由于数据集很大,因此可能更优选基于数据表的解决方案。

Similarly, this is a part of feature engineering that summarizes each ID depending on column called Col by looking back with certain time window. The same preprocess will be applied to the testing set. Since the data set is large, data.table based solution may be more preferred.

训练输入:

ID   Time        Col   Count 
A    2017-06-05   M      1
A    2017-06-02   M      1
A    2017-06-03   M      1
B    2017-06-02   K      1
B    2017-06-01   M      4

通过应用两个回顾过去的日子,我们有:

By applying two days looking back, we have:

ID   Time          Time-2D   Col   Count
A    2017-06-05   2017-06-03   M      1   #Time-2D by moving time two days back
A    2017-06-02   2017-05-31   M      1
A    2017-06-03   2017-06-01   M      1
B    2017-06-02   2017-05-31   K      1
B    2017-06-01   2017-05-30   M      4

预期的输出(计数)

ID   Time          Time-2D   Col_M    Col_K
A    2017-06-05   2017-06-03   1      0     #from 2017-06-03 to 2017-06-05, for user A, there are 0 (sum(count)) of K and 1 (sum(count)) of M. 
A    2017-06-02   2017-05-31   1      0
A    2017-06-03   2017-06-01   2      0     # 2 is because from 06-01 to 06-03, there is two rows in the first table (A    2017-06-02   M      1; A    2017-06-03   M      1) that the count summarization on M is 2. 
B    2017-06-02   2017-05-31   0      1
B    2017-06-01   2017-05-30   4      0



2。计算比率



根据上表,
预期产出(比率):

2. Calculate ratio

Based on above table, Expected output (ratio):

ID   Time          Time-2D   Col_M    Col_K
A    2017-06-05   2017-06-03   1      0     # 1/sum(1+0)
A    2017-06-02   2017-05-31   1      0
A    2017-06-03   2017-06-01   1      0     #2/sum(2+0)
B    2017-06-02   2017-05-31   0      1
B    2017-06-01   2017-05-30   1      0     # 4/sum(4+0) 

以上用于处理训练数据。对于测试数据集,如果需要映射到Col_M,Col_K,则意味着,如果其他值(如S)出现在Col中,它将被忽略。

Above is for processing training data. For testing dataset, if requires to mapping over Col_M, Col_K, meaning, if other value like S appearing in Col, it will be ignored.

推荐答案

我想我理解您的要求。您似乎关心观察的顺序,例如,第二个观察 Time 是否在第一个观察 Time 。这没有多大意义,但是为了达到此目的,这里提供了一种高效的data.table解决方案。这基本上是通过 ID Col 和都 Time 行索引(基本上是显示顺序)。之后,只需 dcast 即可从长转换为宽(就像您上一个问题一样)。请注意,结果按日期排序,但是我保留了 rowindx 变量,因此您可以使用 setorder 。另外,我将比率calc保留给您,因为这是非常基本的(提示-不要使用循环,它是完全矢量化的一个衬里)

I think I understand your request. You seem to care about the order of the observations regardless if, for instance, the second observations Time is prior to the first observations Time. That doesn't make much sense, but here is a quit efficient data.table solution in order to achieve this. This is basically does a non-equi join by ID, Col, Both Time columns and the row index (which is basically the appearance order). Afterwards, it just dcast to convert from long to wide (like in your previous question). Note that the result is ordered by the dates, but I've kept the rowindx variable, so you can reorder it back using setorder. Also, I'll keep the ratio calc to you as this is very basic (hint - Don't use loops, it is a fully vectorized one liner)

library(data.table) #v1.10.4+

## Read the data
DT <- fread("ID   Time        Col   Count 
A    2017-06-05   M      1
A    2017-06-02   M      1
A    2017-06-03   M      1
B    2017-06-02   K      1
B    2017-06-01   M      4")

## Prepare the variables we need for the join
DT[, Time := as.IDate(Time)]
DT[, Time_2D := Time - 2L]
DT[, rowindx := .I]

## Non-equi join, sum `Count` by each join
DT2 <- DT[DT, 
          sum(Count), 
          on = .(ID, Col, rowindx <= rowindx, Time <= Time, Time >= Time_2D),
          by = .EACHI]

## Fix column names (a known issue)
setnames(DT2, make.unique(names(DT2)))

## Long to wide (You can reorder back using `rowindx` and `setorder` function)
dcast(DT2, ID + Time + Time.1 + rowindx ~ Col, value.var = "V1", fill = 0)
#    ID       Time     Time.1 rowindx K M
# 1:  A 2017-06-02 2017-05-31       2 0 1
# 2:  A 2017-06-03 2017-06-01       3 0 2
# 3:  A 2017-06-05 2017-06-03       1 0 1
# 4:  B 2017-06-01 2017-05-30       5 0 4
# 5:  B 2017-06-02 2017-05-31       4 1 0

这篇关于如何使用回溯窗口在data.table中获取快速摘要?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆