为每个用户创建累积计数器变量，具有多个条件 [英] Create cumulative counter variable per-user, with multiple conditions

查看：122 发布时间：2017/3/12 12:03:08 r data.table dplyr cumulative-sum

本文介绍了为每个用户创建累积计数器变量，具有多个条件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要根据其他三个变量创建一个计数器变量。

这是一个扩展问题。扩展问题
考虑多个用户下订单的情况在亚马逊。我想计算每个用户的成功订单时间。如果您已成功下订单，计数器变量self加一;如果是失败订单，计数器保持不变。显然，计数器变量将取决于时间，订单状态和用户。

请考虑当t相同但订单状态不同的情况，这并不意味着行是重复的，它有其他不同的列。 / p>

  DT < -  data.table（time = c（1,2,2,2,1,1,2,3 ，1,1），user = c（1,1,1,1,2,3,3,3,4,4），order_status = c（'f'，'f'，'t'，'t' ，'f'，'f'，'t'，'t'，'t'，'t'）
 DT

所需的计数器输出如下。

 时间用户order_status输出
 1：1 1 f 0 
 2：2 1 f 0 
 3：2 1 t 1 
 4：2 1 t 1 
 5：1 2 f 0 
 6：1 3 f 0 
 7：2 3 t 1 
 8：3 3 t 2 
 9：1 4 t 1 
 10：1 4 t 1 
  pre> 
 
解决方案
这里的主要挑战是设置每个组合的第一次出现 $ c> time，user，order_status =='t'为1.然后它是一个简单的累加和，由 user  
 
 
有两种方法可以使用 data.table ：
 
 
  方法1： 
  DT [，id：= 0L 
] [order_status = =t，id：= c（1L，rep（0L，.N-1L））by = names（DT）
] [，id：= cumsum $ b 
这里的第二行表示 1 只有当 order_status ==t。
 
 
 我的一个重大评论的生产代码看起来像这样： 
  DT [，id：= 0L＃set entire id col to 0 
] [order_status ==t ，＃然后，其中订单状态为真
 id：= c（1L，rep（0L，.N-1L）），＃设置（或更新）第一值为1 
 by = names ）＃每个用户，order_status 
] [，id：= cumsum（id），＃然后，获得累积和id 
 by = user]＃为每个用户
 <方法2：使用data.table的   > join + update ：
  DT [，id：= 0L 
] = as.integer（order_status ==t），mult =first，on = names（DT）
] [，id：= cumsum（id），by = user] 
  
这里的第二步与方法1相同，但它直接标识第一个事例，并将其更新为<$ c $如果 order_status ==t通过对基于连接的子集执行更新，则c> 1 。您可以用 unique（DT）替换内部的 DT ，以消除冗余。
 
 
 如果我必须，我会说第一个方法更有效，因为为每个组创建 rep（）相当快，而不是加入+更新。但是，我发现第二种方法更容易识别实际操作是什么，我认为如果你在几个星期后查看你的代码更重要。
 
I need to create a counter variable depending on three other variables.

This is an extension question of this one.extension question
Consider the situations of multiple consumers place order in Amazon. I want to count the successful order times by each user. If you have placed order successfully, the counter variable self plus one;if it is a failed order, the counter remains the same. Obviously, the counter variable will be depend on the time,order status and user. 

Please consider the scenario of when t is the same but the order status is different,which does not mean the row is duplicate, it has other columns that are different.
DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT
The desired counter output is as follow. The 'output' is the counter variable.
    time user order_status output
 1:    1    1            f      0
 2:    2    1            f      0
 3:    2    1            t      1
 4:    2    1            t      1
 5:    1    2            f      0
 6:    1    3            f      0
 7:    2    3            t      1
 8:    3    3            t      2
 9:    1    4            t      1
10:    1    4            t      1

 解决方案 
The main challenge here is to set the first occurrence of every combination of time, user, order_status=='t' to 1. Then it's a simple cumulative sum grouped by user.

Here are two ways to accomplish this using data.table:

Method 1:
DT[, id := 0L
  ][order_status == "t", id := c(1L, rep(0L, .N-1L)), by=names(DT)
   ][, id := cumsum(id), by=user]
The 2nd line here marks the first occurrence by 1 only when order_status == "t".

A heavily commented production code of mine would look something like this:
DT[, id := 0L                       # set entire id col to 0
  ][order_status == "t",            # then, where order status is true
      id := c(1L, rep(0L, .N-1L)),  # set (or update) first value to 1
      by = names(DT)                # for every time,user,order_status
   ][, id := cumsum(id),            # then, get cumulative sum of id
       by = user]                   # for every user




Method 2: Using data.table's join+update:
DT[, id := 0L
  ][DT, id := as.integer(order_status == "t"), mult="first", on=names(DT)
   ][, id := cumsum(id), by=user]
The 2nd step here does the same as in method 1, but it directly identifies the first occurrence and updates it to 1 if order_status == "t" by performing an update on a join based subset. You can replace the DT on the inside with unique(DT) so as to remove redundancy.

If I've to, I'd say 1st method is more efficient, since creating a rep() for each group should be quite fast, as opposed to a join+update. But I find the 2nd method more understandable to identify as to what the actual operation is, which I think is more important if you were to look at your code several weeks after.

                        这篇关于为每个用户创建累积计数器变量，具有多个条件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为每个用户创建累积计数器变量，具有多个条件 [英] Create cumulative counter variable per-user, with multiple conditions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为每个用户创建累积计数器变量，具有多个条件 [英] Create cumulative counter variable per-user, with multiple conditions

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭