基于另一列和分组中的值创建新的r data.table列 [英] Creating a new r data.table column based on values in another column and grouping
问题描述
我有一个 data.table
,包含日期,邮政编码和购买金额。
library(data.table)
set.seed(88)
DT < - data.table(date = Sys.Date() - 365 + sort(sample(1:100, 10)),
zip = sample(c(2000,1150,3000),10,replace = TRUE),
purchaseAmount = sample(1:20,10) b $ b
这将创建以下内容:
date zip purchaseAmount
1:2016-01-08 1150 5
2:2016-01-15 3000 15
3:2016-02-15 1150 16
4:2016-02-20 2000 18
5:2016-03-07 2000 19
6:2016-03-15 2000 11
7:2016-03-17 2000 6
8:2016-04-02 1150 17
9:2016-04-08 3000 7
10:2016-04-09 3000 20
我想添加第四列 earlyPurchases
。此列<
EDIT: >根据Frank的建议,这里是预期输出:
日期zip购买安装new_col
1:2016-01 -08 1150 5 5
2:2016-01-15 3000 15 15
3:2016-02-15 1150 16 16
4:2016-02-20 2000 18 18
5:2016-03-07 2000 19 19
6:2016-03-15 2000 11 30
7:2016-03-17 2000 6 36
8:2016-04-02 1150 17 17
9:2016-04-08 3000 7 7
10:2016-04-09 3000 20 27
有一个 data.table
方法来做这个,或者我应该写一个循环 function
这似乎有效:
DT [,new_col:=
DT [。(zip = zip,d0 = date - 10,d1 = date),on =。(zip,date> = d0 ,date <= d1),
sum(purchaseAmount)
,由= .EACHI] $ V1
]
日期zip purchaseAmount new_col
1:2016-01-08 1150 5 5
2:2016-01-15 3000 15 15
3:2016-02-15 1150 16 16
4:2016-02 -20 2000 18 18
5:2016-03-07 2000 19 19
6:2016-03-15 2000 11 30
7:2016-03-17 2000 6 36
8:2016-04-02 1150 17 17
9:2016-04-08 3000 7 7
10:2016-04-09 3000 20 27
这使用非等值连接,有效地取每行;在每行的 on =
表达式中查找符合条件的所有行;然后按行( by = .EACHI
)求和。在这种情况下,非等值连接可能比某些滚动和总和方法效率较低。
要向data.table添加列,通常的语法是 DT [,new_col:= expression]
。这里,表达式实际上甚至在 DT [...]
之外工作。尝试自行运行:
DT [。(zip = zip,d0 = date - 10,d1 = date) on =。(zip,date> = d0,date< = d1),
sum(purchaseAmount)
,by = .EACHI] $ V1
您可以逐步简化此操作直到只是加入...
DT [。(zip = zip,d0 = date - 10,d1 = date),on =。(zip,date> = d0,date< = d1),
sum(purchaseAmount )
,by = .EACHI]
#注意V1是计算列的默认名称
DT [。(zip = zip,d0 = date - 10,d1 =日期),on =。(zip,date> = d0,date< = d1)]
#现在我们只是加入
连接语法如
x [i,on =。(xcol = icol,xcol2< icol2)]
,如在将?data.table
键入加载了data.table包的R控制台时打开的doc页面中所述。
要开始使用data.table,建议您查看小插曲。之后,这可能看起来更易读。
I have a
data.table
with date, zipcode and purchase amounts.library(data.table) set.seed(88) DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)), zip = sample(c("2000", "1150", "3000"),10, replace = TRUE), purchaseAmount = sample(1:20, 10))
This creates the following:
date zip purchaseAmount 1: 2016-01-08 1150 5 2: 2016-01-15 3000 15 3: 2016-02-15 1150 16 4: 2016-02-20 2000 18 5: 2016-03-07 2000 19 6: 2016-03-15 2000 11 7: 2016-03-17 2000 6 8: 2016-04-02 1150 17 9: 2016-04-08 3000 7 10: 2016-04-09 3000 20
I would like to add a fourth column
earlierPurchases
. This column shouldsum
all the values inpurchaseAmount
for the previous xdate
within thezipcode
.EDIT: As per suggestion from Frank, here is the expected output:
date zip purchaseAmount new_col 1: 2016-01-08 1150 5 5 2: 2016-01-15 3000 15 15 3: 2016-02-15 1150 16 16 4: 2016-02-20 2000 18 18 5: 2016-03-07 2000 19 19 6: 2016-03-15 2000 11 30 7: 2016-03-17 2000 6 36 8: 2016-04-02 1150 17 17 9: 2016-04-08 3000 7 7 10: 2016-04-09 3000 20 27
Is there a
data.table
way to do this, or should I just write a loopingfunction
?解决方案This seems to work:
DT[, new_col := DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), sum(purchaseAmount) , by=.EACHI ]$V1 ] date zip purchaseAmount new_col 1: 2016-01-08 1150 5 5 2: 2016-01-15 3000 15 15 3: 2016-02-15 1150 16 16 4: 2016-02-20 2000 18 18 5: 2016-03-07 2000 19 19 6: 2016-03-15 2000 11 30 7: 2016-03-17 2000 6 36 8: 2016-04-02 1150 17 17 9: 2016-04-08 3000 7 7 10: 2016-04-09 3000 20 27
This uses a "non-equi" join, effectively taking each row; finding all rows that meet our criteria in the
on=
expression for each row; and then summing by row (by=.EACHI
). In this case, a non-equi join is probably less efficient than some rolling-sum approach.
How it works.
To add columns to a data.table, the usual syntax is
DT[, new_col := expression]
. Here, the expression actually works even outside of theDT[...]
. Try running it on its own:DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), sum(purchaseAmount) , by=.EACHI ]$V1
You can progressively simplify this until it's just the join...
DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), sum(purchaseAmount) , by=.EACHI ] # note that V1 is the default name for computed columns DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1)] # now we're down to just the join
The join syntax is like
x[i, on=.(xcol = icol, xcol2 < icol2)]
, as documented in the doc page that opens when you type?data.table
into an R console with the data.table package loaded.To get started with data.table, I'd suggest reviewing the vignettes. After that, this'll probably look a lot more legible.
这篇关于基于另一列和分组中的值创建新的r data.table列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!