为每个用户创建累积计数器变量,具有多个条件 [英] Create cumulative counter variable per-user, with multiple conditions
问题描述
我需要根据其他三个变量创建一个计数器变量。
这是一个扩展问题。扩展问题
考虑多个用户下订单的情况在亚马逊。我想计算每个用户的成功订单时间。如果您已成功下订单,计数器变量self加一;如果是失败订单,计数器保持不变。显然,计数器变量将取决于时间,订单状态和用户。
请考虑当t相同但订单状态不同的情况,这并不意味着行是重复的,它有其他不同的列。 / p>
DT < - data.table(time = c(1,2,2,2,1,1,2,3 ,1,1),user = c(1,1,1,1,2,3,3,3,4,4),order_status = c('f','f','t','t' ,'f','f','t','t','t','t')
DT
所需的计数器输出如下。
时间用户order_status输出
pre>
1:1 1 f 0
2:2 1 f 0
3:2 1 t 1
4:2 1 t 1
5:1 2 f 0
6:1 3 f 0
7:2 3 t 1
8:3 3 t 2
9:1 4 t 1
10:1 4 t 1
解决方案这里的主要挑战是设置每个组合的第一次出现 $ c> time,user,order_status =='t'为1.然后它是一个简单的累加和,由
user
有两种方法可以使用
data.table
:
方法1:
DT [,id:= 0L
] [order_status = =t,id:= c(1L,rep(0L,.N-1L))by = names(DT)
] [,id:= cumsum $ b这里的第二行表示
1
只有当order_status ==t
。
我的一个重大评论的生产代码看起来像这样:
DT [,id:= 0L#set entire id col to 0
。您可以用
] [order_status ==t ,#然后,其中订单状态为真
id:= c(1L,rep(0L,.N-1L)),#设置(或更新)第一值为1
by = names )#每个用户,order_status
] [,id:= cumsum(id),#然后,获得累积和id
by = user]#为每个用户
<方法2:使用data.table的 > join + update :DT [,id:= 0L
] = as.integer(order_status ==t),mult =first,on = names(DT)
] [,id:= cumsum(id),by = user]
这里的第二步与方法1相同,但它直接标识第一个事例,并将其更新为<$ c $如果
order_status ==t
通过对基于连接的子集执行更新,则c> 1unique(DT)
替换内部的DT
,以消除冗余。
如果我必须,我会说第一个方法更有效,因为为每个组创建
rep()
相当快,而不是加入+更新。但是,我发现第二种方法更容易识别实际操作是什么,我认为如果你在几个星期后查看你的代码更重要。I need to create a counter variable depending on three other variables.
This is an extension question of this one.extension question Consider the situations of multiple consumers place order in Amazon. I want to count the successful order times by each user. If you have placed order successfully, the counter variable self plus one;if it is a failed order, the counter remains the same. Obviously, the counter variable will be depend on the time,order status and user.
Please consider the scenario of when t is the same but the order status is different,which does not mean the row is duplicate, it has other columns that are different.
DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t')) DT
The desired counter output is as follow. The 'output' is the counter variable.
time user order_status output 1: 1 1 f 0 2: 2 1 f 0 3: 2 1 t 1 4: 2 1 t 1 5: 1 2 f 0 6: 1 3 f 0 7: 2 3 t 1 8: 3 3 t 2 9: 1 4 t 1 10: 1 4 t 1
解决方案The main challenge here is to set the first occurrence of every combination of
time, user, order_status=='t'
to 1. Then it's a simple cumulative sum grouped byuser
.Here are two ways to accomplish this using
data.table
:Method 1:
DT[, id := 0L ][order_status == "t", id := c(1L, rep(0L, .N-1L)), by=names(DT) ][, id := cumsum(id), by=user]
The 2nd line here marks the first occurrence by
1
only whenorder_status == "t"
.A heavily commented production code of mine would look something like this:
DT[, id := 0L # set entire id col to 0 ][order_status == "t", # then, where order status is true id := c(1L, rep(0L, .N-1L)), # set (or update) first value to 1 by = names(DT) # for every time,user,order_status ][, id := cumsum(id), # then, get cumulative sum of id by = user] # for every user
Method 2: Using data.table's join+update:
DT[, id := 0L ][DT, id := as.integer(order_status == "t"), mult="first", on=names(DT) ][, id := cumsum(id), by=user]
The 2nd step here does the same as in method 1, but it directly identifies the first occurrence and updates it to
1
iforder_status == "t"
by performing an update on a join based subset. You can replace theDT
on the inside withunique(DT)
so as to remove redundancy.If I've to, I'd say 1st method is more efficient, since creating a
rep()
for each group should be quite fast, as opposed to a join+update. But I find the 2nd method more understandable to identify as to what the actual operation is, which I think is more important if you were to look at your code several weeks after.这篇关于为每个用户创建累积计数器变量,具有多个条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!