为每个用户创建累积计数器变量,具有多个条件 [英] Create cumulative counter variable per-user, with multiple conditions

查看:122
本文介绍了为每个用户创建累积计数器变量,具有多个条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据其他三个变量创建一个计数器变量。



这是一个扩展问题。扩展问题
考虑多个用户下订单的情况在亚马逊。我想计算每个用户的成功订单时间。如果您已成功下订单,计数器变量self加一;如果是失败订单,计数器保持不变。显然,计数器变量将取决于时间,订单状态和用户。



请考虑当t相同但订单状态不同的情况,这并不意味着行是重复的,它有其他不同的列。 / p>

  DT < -  data.table(time = c(1,2,2,2,1,1,2,3 ,1,1),user = c(1,1,1,1,2,3,3,3,4,4),order_status = c('f','f','t','t' ,'f','f','t','t','t','t')
DT

所需的计数器输出如下。

 时间用户order_status输出
1:1 1 f 0
2:2 1 f 0
3:2 1 t 1
4:2 1 t 1
5:1 2 f 0
6:1 3 f 0
7:2 3 t 1
8:3 3 t 2
9:1 4 t 1
10:1 4 t 1
pre>

解决方案

这里的主要挑战是设置每个组合的第一次出现 $ c> time,user,order_status =='t'为1.然后它是一个简单的累加和,由 user

有两种方法可以使用 data.table



方法1:

  DT [,id:= 0L 
] [order_status = =t,id:= c(1L,rep(0L,.N-1L))by = names(DT)
] [,id:= cumsum $ b

这里的第二行表示 1 只有当 order_status ==t



我的一个重大评论的生产代码看起来像这样:

  DT [,id:= 0L#set entire id col to 0 
] [order_status ==t ,#然后,其中订单状态为真
id:= c(1L,rep(0L,.N-1L)),#设置(或更新)第一值为1
by = names )#每个用户,order_status
] [,id:= cumsum(id),#然后,获得累积和id
by = user]#为每个用户
<方法2:使用data.table的 > join + update

  DT [,id:= 0L 
] = as.integer(order_status ==t),mult =first,on = names(DT)
] [,id:= cumsum(id),by = user]

这里的第二步与方法1相同,但它直接标识第一个事例,并将其更新为<$ c $如果 order_status ==t通过对基于连接的子集执行更新,则c> 1 。您可以用 unique(DT)替换内部的 DT ,以消除冗余。



如果我必须,我会说第一个方法更有效,因为为每个组创建 rep()相当快,而不是加入+更新。但是,我发现第二种方法更容易识别实际操作是什么,我认为如果你在几个星期后查看你的代码更重要。


I need to create a counter variable depending on three other variables.

This is an extension question of this one.extension question Consider the situations of multiple consumers place order in Amazon. I want to count the successful order times by each user. If you have placed order successfully, the counter variable self plus one;if it is a failed order, the counter remains the same. Obviously, the counter variable will be depend on the time,order status and user.

Please consider the scenario of when t is the same but the order status is different,which does not mean the row is duplicate, it has other columns that are different.

DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT

The desired counter output is as follow. The 'output' is the counter variable.

    time user order_status output
 1:    1    1            f      0
 2:    2    1            f      0
 3:    2    1            t      1
 4:    2    1            t      1
 5:    1    2            f      0
 6:    1    3            f      0
 7:    2    3            t      1
 8:    3    3            t      2
 9:    1    4            t      1
10:    1    4            t      1

解决方案

The main challenge here is to set the first occurrence of every combination of time, user, order_status=='t' to 1. Then it's a simple cumulative sum grouped by user.

Here are two ways to accomplish this using data.table:

Method 1:

DT[, id := 0L
  ][order_status == "t", id := c(1L, rep(0L, .N-1L)), by=names(DT)
   ][, id := cumsum(id), by=user]

The 2nd line here marks the first occurrence by 1 only when order_status == "t".

A heavily commented production code of mine would look something like this:

DT[, id := 0L                       # set entire id col to 0
  ][order_status == "t",            # then, where order status is true
      id := c(1L, rep(0L, .N-1L)),  # set (or update) first value to 1
      by = names(DT)                # for every time,user,order_status
   ][, id := cumsum(id),            # then, get cumulative sum of id
       by = user]                   # for every user


Method 2: Using data.table's join+update:

DT[, id := 0L
  ][DT, id := as.integer(order_status == "t"), mult="first", on=names(DT)
   ][, id := cumsum(id), by=user]

The 2nd step here does the same as in method 1, but it directly identifies the first occurrence and updates it to 1 if order_status == "t" by performing an update on a join based subset. You can replace the DT on the inside with unique(DT) so as to remove redundancy.

If I've to, I'd say 1st method is more efficient, since creating a rep() for each group should be quite fast, as opposed to a join+update. But I find the 2nd method more understandable to identify as to what the actual operation is, which I think is more important if you were to look at your code several weeks after.

这篇关于为每个用户创建累积计数器变量,具有多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆