相对窗口运行和通过data.table非等值连接 [英] relative windowed running sum through data.table non-equi join
问题描述
我有一个数据集customerId,transactionDate,productId,purchaseQty加载到一个data.table中。对于每一行,我要计算前45天的总收入和平均销售量。
I have a data set customerId, transactionDate, productId, purchaseQty loaded into a data.table. for each row, I want to calculate the sum, and mean of purchaseQty for the prior 45 day
productId customerID transactionDate purchaseQty
1: 870826 1186951 2016-03-28 162000
2: 870826 1244216 2016-03-31 5000
3: 870826 1244216 2016-04-08 6500
4: 870826 1308671 2016-03-28 221367
5: 870826 1308671 2016-03-29 83633
6: 870826 1308671 2016-11-29 60500
我正在寻找这样的输出:
I'm looking for an output like this:
productId customerID transactionDate purchaseQty sumWindowPurchases
1: 870826 1186951 2016-03-28 162000 162000
2: 870826 1244216 2016-03-31 5000 5000
3: 870826 1244216 2016-04-08 6500 11500
4: 870826 1308671 2016-03-28 221367 221367
5: 870826 1308671 2016-03-29 83633 305000
6: 870826 1308671 2016-11-29 60500 60500
因此,sumWindowPurchases包含从当前交易日期起45天窗口内客户/产品的purchaseQty总和。一旦我有这个工作,抛出的平均值和其他计算我需要的应该是微不足道的。
so, sumWindowPurchases contains the sum of purchaseQty for the customer/product over a 45 day window from the current transaction date. Once i have that working, throwing the mean, and other calcs I need should be trivial
我回到我的SQL根,并想到一个自连接:
I went back to my SQL roots and thought of a self join:
select DT.customerId, DT.transactionDate, DT.productId, sum(DT1.purchaseQty)
from DT
inner join DT as DT1 on
DT.customerId = DT1.customerId
and DT.productId = DT1.productId
and DT1.transactionDate between DT.transactionDate and dateadd(day, -45, DT.transactionDate)
尝试使用data.dable语法将其翻译成R,我希望做某事像这样:
Trying to translate that into R using data.dable syntax, I was hoping to do something like this:
DT1 <- DT #alias. have confirmed this is just a pointer
DT[DT1[DT1$transactionDate >= DT$transactionDate - 45],
.(sum(DT1$purchaseQty)),
by = .(DT$customerId , DT$transactionDate ),
on = .(customerId , DT1$transactionDate <= DT$TransactionDate),
allow.cartesian = TRUE]
我想我有一个2部分的问题。什么是R方式做到这一点。是一个data.table自联接正确的方法,或者woudl我更好地尝试使用Reduce函数?
I guess I have a 2 part question. What is the "R way" to do this. Is a data.table self join the correct approach, or woudl i be better of trying to use the Reduce function?
我怀疑自连接是唯一的方法滚动45天窗口在那里。所以第二部分是我需要一些帮助data.table语法来显式引用列来自哪个源表,因为它的自联接,他们有相同的列名。
I suspect the self join is the only way to get the rolling 45 day window in there. so part 2 is I need some help with the data.table syntax to explicitly reference which source table the column comes from, since its a self join and they have the same column names.
我一直在研究弗兰克与这个表达式有关的答案。
Ive been studying the answers that Frank linked to and have come up with this expression
DT[.(p = productId, c = customerID, t = transactionDate, start = transactionDate - 45),
on = .(productId==p, customerID==c, transactionDate<=t, transactionDate>=start),
allow.cartesian = TRUE, nomatch = 0]
生成此输出:
productId customerID transactionDate purchaseQty transactionDate.1
1: 870826 1186951 2016-03-28 162000 2016-02-12
2: 870826 1244216 2016-03-31 5000 2016-02-15
3: 870826 1244216 2016-04-08 5000 2016-02-23
4: 870826 1244216 2016-04-08 6500 2016-02-23
5: 870826 1308671 2016-03-28 221367 2016-02-12
6: 870826 1308671 2016-03-29 221367 2016-02-13
7: 870826 1308671 2016-03-29 83633 2016-02-13
8: 870826 1308671 2016-11-29 60500 2016-10-15
这是非常接近,我需要得到我的最后一步。如果我可以总结这个输出的购买数量,按客户/产品/ transactionDate.1分组,我会有一些有用的。然而,我不能得到的语法下来,不是我理解transactionDate.1名字是从哪里来的
This is very close, to what i need to get to my final step. if i could sum the purchase quantities of this output, group by customer/product/transactionDate.1, i would have something useful. however, I cant get the syntax down for that, not do I understand where the transactionDate.1 name is coming from
推荐答案
也工作,它可以被认为更简单。它具有不需要排序输入集,并且具有较少依赖性的优点。
This also works, it could be considered simpler. It has the advantage of not requiring a sorted input set, and has fewer dependencies.
我仍然不知道为什么它在输出中产生2个transactionDate列。这似乎是on子句的副产品。实际上,输出的列和顺序似乎在on子句的所有元素之后附加和,没有它们的别名。
I still don't know understand why it produces 2 transactionDate columns in the output. This seems to be a byproduct of the "on" clause. In fact, columns and order of the output seems to append the sum after all elements of the on clause, without their alias names
DT[.(p=productId, c=customerID, tmin=transactionDate - 45, tmax=transactionDate),
on = .(productId==p, customerID==c, transactionDate<=tmax, transactionDate>=tmin),
.(windowSum = sum(purchaseQty)), by = .EACHI, nomatch = 0]
这篇关于相对窗口运行和通过data.table非等值连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!