创建具有多个条件的每个用户的累积计数器变量 [英] Create cumulative counter variable per-user, with multiple conditions

查看:138
本文介绍了创建具有多个条件的每个用户的累积计数器变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要根据其他三个变量创建一个计数器变量。



这是一个扩展问题。扩展问题
考虑多个消费者下订单的情况在亚马逊我想计算每个用户的顺序顺序。如果您已成功下单,则计数器变量self加1;如果是失败的订单,计数器保持不变。显然,计数器变量将取决于时间,订单状态和用户。



请考虑t何时相同但订单状态不同的情况,这并不意味着该行是重复的,它还有其他不同的列。 / p>

  DT<  -  data.table(time = c(1,2,2,2,1,1,2,3 ,1,1),user = c(1,1,1,1,3,3,3,4,4),order_status = c('f','f','t','t' ,'f','f','t','t','t','t'))
DT

所需的计数器输出如下。 '输出'是计数器变量。

  time user order_status输出
1:1 1 f 0
2:2 1 f 0
3:2 1 t 1
4:2 1 t 1
5:1 2 f 0
6:1 3 f 0
7:2 3 t 1
8:3 3 t 2
9:1 4 t 1
10:1 4 t 1
pre>

解决方案

这里的主要挑战是设置<$ c的每个组合的第一次出现 $ c> time,user,order_status =='t' to 1.然后,这是一个简单的累积总和,分组为 user



以下是使用 data.table 的两种方法:



方法1:

  DT [,id:= 0L 
] [order_status = =t,id:= c(1L,rep(0L,.N-1L)),b y = name(DT)
] [,id = = cumsum(id),by = user]

只有当 order_status ==t时,此处的第2行将标记第一次出现 1 / p>

我的一个大量评论的生产代码看起来像这样:

  DT [,id:= 0L#将整个id列设置为0 
] [order_status ==t,#then,其中订单状态为true
id:= c(1L,rep(0L ,.N-1L)),#将(或更新)第一个值设置为1
by = names(DT)#每次,user,order_status
] [,id:= cumsum(id) ,#然后,为每个用户




获得id
by = user]

方法2:使用data.table的 join + update

  DT [,id:= 0L 
] [DT,id:= as.integer(order_status ==t),mult =first,on = names DT)
] [,id:= cumsum(i d),by = user]

这里的第二步与方法1中的相同,但是直接识别第一次发生,并通过对联接执行更新,将其更新为 1 ,如果 order_status ==t基于子集。您可以使用 unique(DT)替换内部的 DT ,以消除冗余。



如果我要,我会说第一种方法更有效率,因为为每个组创建一个 rep()应该是相当快,而不是加入+更新。但是,我发现第二种方法更容易识别实际操作是什么,而我认为更重要的是如果你在几周之后看你的代码。


I need to create a counter variable depending on three other variables.

This is an extension question of this one.extension question Consider the situations of multiple consumers place order in Amazon. I want to count the successful order times by each user. If you have placed order successfully, the counter variable self plus one;if it is a failed order, the counter remains the same. Obviously, the counter variable will be depend on the time,order status and user.

Please consider the scenario of when t is the same but the order status is different,which does not mean the row is duplicate, it has other columns that are different.

DT <- data.table(time=c(1,2,2,2,1,1,2,3,1,1),user=c(1,1,1,1,2,3,3,3,4,4), order_status=c('f','f','t','t','f','f','t','t','t','t'))
DT

The desired counter output is as follow. The 'output' is the counter variable.

    time user order_status output
 1:    1    1            f      0
 2:    2    1            f      0
 3:    2    1            t      1
 4:    2    1            t      1
 5:    1    2            f      0
 6:    1    3            f      0
 7:    2    3            t      1
 8:    3    3            t      2
 9:    1    4            t      1
10:    1    4            t      1

解决方案

The main challenge here is to set the first occurrence of every combination of time, user, order_status=='t' to 1. Then it's a simple cumulative sum grouped by user.

Here are two ways to accomplish this using data.table:

Method 1:

DT[, id := 0L
  ][order_status == "t", id := c(1L, rep(0L, .N-1L)), by=names(DT)
   ][, id := cumsum(id), by=user]

The 2nd line here marks the first occurrence by 1 only when order_status == "t".

A heavily commented production code of mine would look something like this:

DT[, id := 0L                       # set entire id col to 0
  ][order_status == "t",            # then, where order status is true
      id := c(1L, rep(0L, .N-1L)),  # set (or update) first value to 1
      by = names(DT)                # for every time,user,order_status
   ][, id := cumsum(id),            # then, get cumulative sum of id
       by = user]                   # for every user


Method 2: Using data.table's join+update:

DT[, id := 0L
  ][DT, id := as.integer(order_status == "t"), mult="first", on=names(DT)
   ][, id := cumsum(id), by=user]

The 2nd step here does the same as in method 1, but it directly identifies the first occurrence and updates it to 1 if order_status == "t" by performing an update on a join based subset. You can replace the DT on the inside with unique(DT) so as to remove redundancy.

If I've to, I'd say 1st method is more efficient, since creating a rep() for each group should be quite fast, as opposed to a join+update. But I find the 2nd method more understandable to identify as to what the actual operation is, which I think is more important if you were to look at your code several weeks after.

这篇关于创建具有多个条件的每个用户的累积计数器变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆