基于另一列和分组中的值创建新的r data.table列 [英] Creating a new r data.table column based on values in another column and grouping

查看：127 发布时间：2017/3/12 11:00:23 r data.table

本文介绍了基于另一列和分组中的值创建新的r data.table列的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个 data.table ，包含日期，邮政编码和购买金额。

  library（data.table）
 set.seed（88）
 DT < -  data.table（date = Sys.Date（） -  365 + sort（sample（1：100， 10）），
 zip = sample（c（2000，1150，3000），10，replace = TRUE），
 purchaseAmount = sample（1:20,10） b $ b

这将创建以下内容：

  date zip purchaseAmount 
 1：2016-01-08 1150 5 
 2：2016-01-15 3000 15 
 3：2016-02-15 1150 16 
 4：2016-02-20 2000 18 
 5：2016-03-07 2000 19 
 6：2016-03-15 2000 11 
 7：2016-03-17 2000 6 
 8：2016-04-02 1150 17 
 9：2016-04-08 3000 7 
 10：2016-04-09 3000 20

我想添加第四列 earlyPurchases 。此列< c>

 
 
   EDIT：  >根据Frank的建议，这里是预期输出：
 日期zip购买安装new_col 
 1：2016-01 -08 1150 5 5 
 2：2016-01-15 3000 15 15 
 3：2016-02-15 1150 16 16 
 4：2016-02-20 2000 18 18 
 5：2016-03-07 2000 19 19 
 6：2016-03-15 2000 11 30 
 7：2016-03-17 2000 6 36 
 8：2016-04-02 1150 17 17 
 9：2016-04-08 3000 7 7 
 10：2016-04-09 3000 20 27 
  
有一个 data.table 方法来做这个，或者我应该写一个循环 function  
解决方案
这似乎有效：
  DT [，new_col：= 
 DT [。（zip = zip，d0 = date  -  10，d1 = date），on =。（zip，date> = d0 ，date <= d1），
 sum（purchaseAmount）
，由= .EACHI] $ V1 
] 
 
 
日期zip purchaseAmount new_col 
 1：2016-01-08 1150 5 5 
 2：2016-01-15 3000 15 15 
 3：2016-02-15 1150 16 16 
 4：2016-02 -20 2000 18 18 
 5：2016-03-07 2000 19 19 
 6：2016-03-15 2000 11 30 
 7：2016-03-17 2000 6 36 
 8：2016-04-02 1150 17 17 
 9：2016-04-08 3000 7 7 
 10：2016-04-09 3000 20 27 
  
这使用非等值连接，有效地取每行;在每行的 on = 表达式中查找符合条件的所有行;然后按行（ by = .EACHI ）求和。在这种情况下，非等值连接可能比某些滚动和总和方法效率较低。
 
 
 
 
 
   
 
 
 要向data.table添加列，通常的语法是 DT [，new_col：= expression] 。这里，表达式实际上甚至在 DT [...] 之外工作。尝试自行运行：
  DT [。（zip = zip，d0 = date  -  10，d1 = date） on =。（zip，date> = d0，date< = d1），
 sum（purchaseAmount）
，by = .EACHI] $ V1 
  
 
 
 您可以逐步简化此操作直到只是加入... 
  DT [。（zip = zip，d0 = date  -  10，d1 = date），on =。（zip，date> = d0，date< = d1），
 sum（purchaseAmount ）
，by = .EACHI] 
＃注意V1是计算列的默认名称
 
 DT [。（zip = zip，d0 = date  -  10，d1 =日期），on =。（zip，date> = d0，date< = d1）] 
＃现在我们只是加入
  
连接语法如 x [i，on =。（xcol = icol，xcol2< icol2）] ，如在将？data.table 键入加载了data.table包的R控制台时打开的doc页面中所述。 
 
 
 要开始使用data.table，建议您查看小插曲。之后，这可能看起来更易读。
 
I have a data.table with date, zipcode and purchase amounts.
library(data.table)
set.seed(88)
DT <- data.table(date = Sys.Date()-365 + sort(sample(1:100, 10)), 
zip = sample(c("2000", "1150", "3000"),10, replace = TRUE), 
purchaseAmount = sample(1:20, 10))  
This creates the following:
    date       zip              purchaseAmount
 1: 2016-01-08 1150              5
 2: 2016-01-15 3000             15
 3: 2016-02-15 1150             16
 4: 2016-02-20 2000             18
 5: 2016-03-07 2000             19
 6: 2016-03-15 2000             11
 7: 2016-03-17 2000              6
 8: 2016-04-02 1150             17
 9: 2016-04-08 3000              7
10: 2016-04-09 3000             20
I would like to add a fourth column earlierPurchases. This column should sum all the values in purchaseAmount for the previous x date within the zipcode.

EDIT: As per suggestion from Frank, here is the expected output:
          date  zip purchaseAmount new_col
 1: 2016-01-08 1150              5       5
 2: 2016-01-15 3000             15      15
 3: 2016-02-15 1150             16      16
 4: 2016-02-20 2000             18      18
 5: 2016-03-07 2000             19      19
 6: 2016-03-15 2000             11      30
 7: 2016-03-17 2000              6      36
 8: 2016-04-02 1150             17      17
 9: 2016-04-08 3000              7       7
10: 2016-04-09 3000             20      27
Is there a data.table way to do this, or should I just write a looping function?
 解决方案 
This seems to work:
DT[, new_col := 
  DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
    sum(purchaseAmount)
  , by=.EACHI ]$V1
]


          date  zip purchaseAmount new_col
 1: 2016-01-08 1150              5       5
 2: 2016-01-15 3000             15      15
 3: 2016-02-15 1150             16      16
 4: 2016-02-20 2000             18      18
 5: 2016-03-07 2000             19      19
 6: 2016-03-15 2000             11      30
 7: 2016-03-17 2000              6      36
 8: 2016-04-02 1150             17      17
 9: 2016-04-08 3000              7       7
10: 2016-04-09 3000             20      27
This uses a "non-equi" join, effectively taking each row; finding all rows that meet our criteria in the on= expression for each row; and then summing by row (by=.EACHI). In this case, a non-equi join is probably less efficient than some rolling-sum approach.



How it works.

To add columns to a data.table, the usual syntax is DT[, new_col := expression]. Here, the expression actually works even outside of the DT[...]. Try running it on its own:
DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
  sum(purchaseAmount)
, by=.EACHI ]$V1
You can progressively simplify this until it's just the join...
DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1), 
  sum(purchaseAmount)
, by=.EACHI ]
# note that V1 is the default name for computed columns

DT[.(zip = zip, d0 = date - 10, d1 = date), on=.(zip, date >= d0, date <= d1)]
# now we're down to just the join
The join syntax is like x[i, on=.(xcol = icol, xcol2 < icol2)], as documented in the doc page that opens when you type ?data.table into an R console with the data.table package loaded. 

To get started with data.table, I'd suggest reviewing the vignettes. After that, this'll probably look a lot more legible.

                        这篇关于基于另一列和分组中的值创建新的r data.table列的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

基于另一列和分组中的值创建新的r data.table列 [英] Creating a new r data.table column based on values in another column and grouping

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

基于另一列和分组中的值创建新的r data.table列 [英] Creating a new r data.table column based on values in another column and grouping

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭