在某些条件下如何创建等级变量? [英] How to create a rank variable under certain conditions?

查看:77
本文介绍了在某些条件下如何创建等级变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据包含时间变量和选择的品牌变量,如下所示.时间表示购物时间,所选品牌表示当时的购买品牌.

My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time.

使用此数据,我想创建排名变量,如第三列,第四列等.

With this data, I would like to create rank variable as shown third column, fourth column, and so on.

品牌排名(例如brand1-brand3)应基于过去的36小时.因此,要计算第二行的购物时间为"2013-09-01 08:54:00 UTC"的等级,该等级应基于时间前36小时内的所有chosenbrand值. (第二行中的brand1不应在36小时之内)

The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC" the rank should be based on all chosenbrand values within 36 hours before the time. (brand1 in second row should not be in the 36 hours)

因此,rank_brand1,rank_brand2,rank_brand3,rank_bran4是我想要的变量.

Therefore, rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, are my desired variables.

如果我要创建rank_brand5,也要创建rank_brand6 ...

If I want to create rank_brand5, rank_brand6 as well...

有什么简单的方法吗?

此外,如果我想按个人进行操作(如果每个客户都有多次购买的历史记录),该怎么做?

In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?

数据如下,

          shoptime          chosenbrand  rank_brand1 rank_brand2 rank_brand3, ...
  2013-09-01 08:35:00 UTC      brand1          NA         NA          NA
  2013-09-01 08:54:00 UTC      brand1          1          NA          NA
  2013-09-01 09:07:00 UTC      brand2          1          2          NA
  2013-09-01 09:08:00 UTC      brand3          1          2          3
  2013-09-01 09:11:00 UTC      brand5          1          2          3
  2013-09-01 09:14:00 UTC      brand2          1          2          3
  2013-09-01 09:26:00 UTC      brand6          1          1          3
  2013-09-01 09:26:00 UTC      brand2          1          1          3
  2013-09-01 09:29:00 UTC      brand2          2          1          3
  2013-09-01 09:32:00 UTC      brand4          2          1          3

这是数据代码

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
                           "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
                  chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
                  rank_brand1 = NA,
                  rank_brand2 = NA,
                 rank_brand3 = NA,
                  stringsAsFactors = FALSE)

推荐答案

这是一个棘手的问题.下面的解决方案使用 non-equi joins 进行36小时的汇总,使用dcast()从长格式到宽格式进行整形,然后使用原始的dat进行第二次连接.可以有任意多个品牌.

This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast() to reshape from long to wide format, and a second join with the original dat. There can be an arbitrary number of brands.

library(data.table)
library(lubridate)

setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
        dat, on = "shoptime"]

               shoptime brand1 brand2 brand3 brand5 brand6  brand
 1: 2013-09-01 08:35:00     NA     NA     NA     NA     NA brand1
 2: 2013-09-01 08:54:00      1     NA     NA     NA     NA brand1
 3: 2013-09-01 09:07:00      1     NA     NA     NA     NA brand2
 4: 2013-09-01 09:08:00      1      2     NA     NA     NA brand3
 5: 2013-09-01 09:11:00      1      2      2     NA     NA brand5
 6: 2013-09-01 09:14:00      1      2      2      2     NA brand2
 7: 2013-09-01 09:26:00      1      1      2      2     NA brand6
 8: 2013-09-01 09:26:00      1      1      2      2     NA brand2
 9: 2013-09-01 09:29:00      2      1      3      3      3 brand2
10: 2013-09-01 09:32:00      2      1      3      3      3 brand4

说明

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]

每36小时返回汇总结果:

returns the aggregated results per 36 hours periods:

               shoptime            shoptime  brand N rank
 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1    1
 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2    1
 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2    1
 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1    2
 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2    1
 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1    2
 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1    2
 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2    1
 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1    2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1    2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1    2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2    2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3    1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1    3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1    3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1    3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2    2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4    1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1    3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1    3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1    3
               shoptime            shoptime  brand N rank

然后,此中间结果从长格式更改为宽格式:

Then, this intermediate result is reshaped from long to wide format:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]

              shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00      1     NA     NA     NA     NA
2: 2013-09-01 09:07:00      1     NA     NA     NA     NA
3: 2013-09-01 09:08:00      1      2     NA     NA     NA
4: 2013-09-01 09:11:00      1      2      2     NA     NA
5: 2013-09-01 09:14:00      1      2      2      2     NA
6: 2013-09-01 09:26:00      1      1      2      2     NA
7: 2013-09-01 09:29:00      2      1      3      3      3
8: 2013-09-01 09:32:00      2      1      3      3      3

与原始dat数据框的最后一个右连接完成了丢失的行和列(请参见上面的代码和结果).

The final right join with the original dat data frame completes the missing rows and columns (see code and result above).

dat <- data.frame(
  shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
  brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
  stringsAsFactors = FALSE)

这篇关于在某些条件下如何创建等级变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆