如何在特定条件下创建等级变量? [英] How to create a rank variable under certain conditions?

查看:20
本文介绍了如何在特定条件下创建等级变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据包含时间变量和选择的品牌变量,如下所示.time 表示购物时间,choosebrand 表示当时购买的品牌.

使用此数据,我想创建排名变量,如第三列、第四列等所示.

品牌排名(例如,品牌 1 - 品牌 3)应基于过去 36 小时.因此,要计算商店时间为 "2013-09-01 08:54:00 UTC" 的第二行的排名,排名应基于所有 chosenbrand时间前 36 小时内的值.(第二排的brand1不应在36小时内)

因此,rank_brand1、rank_brand2、rank_brand3、rank_bran4,,,是我想要的变量.

如果我想创建 rank_brand5、rank_brand6 以及...

有什么简单的方法吗?

另外,如果我想一个人做(如果每个客户都有好几个购买记录),怎么做?

数据如下,

 shoptime selectedbrand rank_brand1 rank_brand2 rank_brand3, ...2013-09-01 08:35:00 UTC 品牌1 NA NA NA2013-09-01 08:54:00 UTC 品牌1 1 NA NA2013-09-01 09:07:00 UTC 品牌2 1 2 NA2013-09-01 09:08:00 UTC 品牌3 1 2 32013-09-01 09:11:00 UTC 品牌5 1 2 32013-09-01 09:14:00 UTC 品牌2 1 2 32013-09-01 09:26:00 UTC 品牌6 1 1 32013-09-01 09:26:00 UTC 品牌2 1 1 32013-09-01 09:29:00 UTC 品牌2 2 1 32013-09-01 09:32:00 UTC 品牌4 2 1 3

这里是数据的代码

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC"、2013-09-01 09:08:00 UTC"、2013-09-01 09:11:00 UTC"、2013-09-01 09:14:00 UTC"","2013-09-01 09:26:00 UTC"、"2013-09-01 09:26:00 UTC"、"2013-09-01 09:29:00 UTC"、"2013-09-01 09:32:00 UTC"),selectedbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2", "brand2", "brand4" ),rank_brand1 = 不适用,rank_brand2 = 不适用,rank_brand3 = 不适用,stringsAsFactors = FALSE)

解决方案

这是一个棘手的问题.下面的解决方案使用 non-equi joins 以 36 小时为单位聚合,dcast() 将长格式重塑为宽格式,然后与原始 进行第二次连接数据.可以有任意数量的品牌.

library(data.table)图书馆(润滑)setDT(dat)[, shoptime := as_datetime(shoptime)]setorder(dat, shoptime) # 不需要,只是为了方便观察者dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime 

<块引用>

 shoptime 品牌1 品牌2 品牌3 品牌5 品牌6 品牌1: 2013-09-01 08:35:00 NA NA NA NA NA 品牌12:2013-09-01 08:54:00 1 NA NA NA NA 品牌13:2013-09-01 09:07:00 1 NA NA NA NA 品牌24: 2013-09-01 09:08:00 1 2 NA NA NA 品牌35: 2013-09-01 09:11:00 1 2 2 NA NA 品牌56: 2013-09-01 09:14:00 1 2 2 2 北美品牌27: 2013-09-01 09:26:00 1 1 2 2 北美品牌68: 2013-09-01 09:26:00 1 1 2 2 北美品牌29: 2013-09-01 09:29:00 2 1 3 3 3 品牌210: 2013-09-01 09:32:00 2 1 3 3 3 品牌4

说明

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime 

返回每 36 小时的汇总结果:

<块引用>

 shoptime shoptime 品牌N等级1: 2013-08-30 20:54:00 2013-09-01 08:54:00 品牌1 1 12: 2013-08-30 21:07:00 2013-09-01 09:07:00 品牌1 2 13: 2013-08-30 21:08:00 2013-09-01 09:08:00 品牌1 2 14: 2013-08-30 21:08:00 2013-09-01 09:08:00 品牌2 1 25: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌1 2 16: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌2 1 27: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌3 1 28: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌1 2 19: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌2 1 210: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌3 1 211: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌5 1 212: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌1 2 113: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌2 2 114: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌3 1 215: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌5 1 216: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌1 2 117: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌2 2 118: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌3 1 219: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌5 1 220: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌1 2 221: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌2 3 122: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌3 1 323: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌5 1 324: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌6 1 325: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌1 2 226: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌2 4 127: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌3 1 328: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌5 1 329: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌6 1 3shoptime shoptime 品牌 N 级

然后,将这个中间结果从长格式改成宽格式:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime 

<块引用>

 shoptime 品牌1 品牌2 品牌3 品牌5 品牌61: 2013-09-01 08:54:00 1 NA NA NA NA2: 2013-09-01 09:07:00 1 NA NA NA NA3: 2013-09-01 09:08:00 1 2 不适用 不适用4: 2013-09-01 09:11:00 1 2 2 不适用 不适用5: 2013-09-01 09:14:00 1 2 2 2 不适用6: 2013-09-01 09:26:00 1 1 2 2 不适用7: 2013-09-01 09:29:00 2 1 3 3 38: 2013-09-01 09:32:00 2 1 3 3 3

与原始 dat 数据框的最终右连接完成缺失的行和列(参见上面的代码和结果).

数据

dat <- data.frame(shoptime = c(2013-09-01 08:35:00 UTC"、2013-09-01 08:54:00 UTC"、2013-09-01 09:07:00 UTC"、";2013-09-01 09:08:00 UTC"、2013-09-01 09:11:00 UTC"、2013-09-01 09:14:00 UTC"、2013-09-01 09:26:00 UTC"、2013-09-01 09:26:00 UTC",2013-09-01 09:29:00 UTC"、2013-09-01 09:32:00 UTC"),品牌 = c(brand1"、brand1"、brand2"、brand3"、brand5"、brand2"、brand6"、brand2"、brand2"、";品牌4"),stringsAsFactors = FALSE)

My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time.

With this data, I would like to create rank variable as shown third column, fourth column, and so on.

The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC" the rank should be based on all chosenbrand values within 36 hours before the time. (brand1 in second row should not be in the 36 hours)

Therefore, rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, are my desired variables.

If I want to create rank_brand5, rank_brand6 as well...

Is there any simple way?

In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?

Data is as below,

          shoptime          chosenbrand  rank_brand1 rank_brand2 rank_brand3, ...
  2013-09-01 08:35:00 UTC      brand1          NA         NA          NA
  2013-09-01 08:54:00 UTC      brand1          1          NA          NA
  2013-09-01 09:07:00 UTC      brand2          1          2          NA
  2013-09-01 09:08:00 UTC      brand3          1          2          3
  2013-09-01 09:11:00 UTC      brand5          1          2          3
  2013-09-01 09:14:00 UTC      brand2          1          2          3
  2013-09-01 09:26:00 UTC      brand6          1          1          3
  2013-09-01 09:26:00 UTC      brand2          1          1          3
  2013-09-01 09:29:00 UTC      brand2          2          1          3
  2013-09-01 09:32:00 UTC      brand4          2          1          3

Here is code for data

dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
                           "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
                  chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
                  rank_brand1 = NA,
                  rank_brand2 = NA,
                 rank_brand3 = NA,
                  stringsAsFactors = FALSE)

解决方案

This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast() to reshape from long to wide format, and a second join with the original dat. There can be an arbitrary number of brands.

library(data.table)
library(lubridate)

setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
        dat, on = "shoptime"]

               shoptime brand1 brand2 brand3 brand5 brand6  brand
 1: 2013-09-01 08:35:00     NA     NA     NA     NA     NA brand1
 2: 2013-09-01 08:54:00      1     NA     NA     NA     NA brand1
 3: 2013-09-01 09:07:00      1     NA     NA     NA     NA brand2
 4: 2013-09-01 09:08:00      1      2     NA     NA     NA brand3
 5: 2013-09-01 09:11:00      1      2      2     NA     NA brand5
 6: 2013-09-01 09:14:00      1      2      2      2     NA brand2
 7: 2013-09-01 09:26:00      1      1      2      2     NA brand6
 8: 2013-09-01 09:26:00      1      1      2      2     NA brand2
 9: 2013-09-01 09:29:00      2      1      3      3      3 brand2
10: 2013-09-01 09:32:00      2      1      3      3      3 brand4

Explanation

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]

returns the aggregated results per 36 hours periods:

               shoptime            shoptime  brand N rank
 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1    1
 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2    1
 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2    1
 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1    2
 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2    1
 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1    2
 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1    2
 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2    1
 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1    2
10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1    2
11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1    2
12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2    1
17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2    1
18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1    2
19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1    2
20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2    2
21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3    1
22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1    3
23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1    3
24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1    3
25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2    2
26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4    1
27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1    3
28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1    3
29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1    3
               shoptime            shoptime  brand N rank

Then, this intermediate result is reshaped from long to wide format:

dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub), 
    nomatch = 0L, by = .EACHI, 
    .SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
      , dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]

              shoptime brand1 brand2 brand3 brand5 brand6
1: 2013-09-01 08:54:00      1     NA     NA     NA     NA
2: 2013-09-01 09:07:00      1     NA     NA     NA     NA
3: 2013-09-01 09:08:00      1      2     NA     NA     NA
4: 2013-09-01 09:11:00      1      2      2     NA     NA
5: 2013-09-01 09:14:00      1      2      2      2     NA
6: 2013-09-01 09:26:00      1      1      2      2     NA
7: 2013-09-01 09:29:00      2      1      3      3      3
8: 2013-09-01 09:32:00      2      1      3      3      3

The final right join with the original dat data frame completes the missing rows and columns (see code and result above).

Data

dat <- data.frame(
  shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
               "2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
  brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2"  ,  "brand2"  ,   "brand4"   ),
  stringsAsFactors = FALSE)

这篇关于如何在特定条件下创建等级变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆