如何在特定条件下创建排名变量? [英] How to create a rank variable under certain conditions?
问题描述
我的数据包含时间变量和选择的品牌变量,如下所示.time表示购物时间,choicebrand表示当时购买的品牌.
使用这些数据,我想创建排名变量,如第三列、第四列等所示.
品牌排名(例如,brand1 - brand3)应基于过去 36 小时.因此,要计算第二行的排名,其中 shoptime 为 "2013-09-01 08:54:00 UTC"
,排名应基于所有 chosenbrand
时间前 36 小时内的值.(第二行的brand1
不应在36小时内)
因此,rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, 是我想要的变量.
如果我想创建 rank_brand5,rank_brand6 也是如此......
有什么简单的方法吗?
另外,如果我要个人做(如果每个客户都有多个购买历史),该怎么做?
数据如下,
shoptime selectedbrand rank_brand1 rank_brand2 rank_brand3, ...2013-09-01 08:35:00 UTC 品牌1 NA NA NA2013-09-01 08:54:00 UTC 品牌1 1 NA NA2013-09-01 09:07:00 UTC 品牌2 1 2 NA2013-09-01 09:08:00 UTC 品牌3 1 2 32013-09-01 09:11:00 UTC 品牌5 1 2 32013-09-01 09:14:00 UTC 品牌2 1 2 32013-09-01 09:26:00 UTC 品牌6 1 1 32013-09-01 09:26:00 UTC 品牌2 1 1 32013-09-01 09:29:00 UTC 品牌2 2 1 32013-09-01 09:32:00 UTC 品牌4 2 1 3
这是数据代码
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC"、2013-09-01 09:08:00 UTC"、2013-09-01 09:11:00 UTC"、2013-09-01 09:14:00 UTC"",2013-09-01 09:26:00 UTC"、2013-09-01 09:26:00 UTC"、2013-09-01 09:29:00 UTC"、2013-09-01 09:世界标准时间 32:00"),选择品牌= c(品牌1",品牌1",品牌2",品牌3",品牌5",品牌2",品牌6",品牌2",品牌2",品牌4"),rank_brand1 = 不适用,rank_brand2 = 不适用,rank_brand3 = 不适用,字符串AsFactors = FALSE)
这是一个棘手的问题.下面的解决方案使用 non-equi 连接 以 36 小时为周期聚合,dcast()
将长格式重塑为宽格式,并使用原始 进行第二次连接数据
.可以有任意数量的品牌.
库(data.table)图书馆(润滑)setDT(dat)[, shoptime := as_datetime(shoptime)]setorder(dat, shoptime) # 不需要,只是为了方便观察者dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),nomatch = 0L,按 = .EACHI,.SD[, .N, by = brand][, rank := frank(-N, ties.method=dense")]][, dcast(unique(.SD[, -1]), shoptime ~ 品牌, value.var = "rank")][dat, on = "shoptime"]
<块引用>
shoptime 品牌1 品牌2 品牌3 品牌5 品牌6 品牌1: 2013-09-01 08:35:00 NA NA NA NA NA 品牌12: 2013-09-01 08:54:00 1 NA NA NA NA 品牌13: 2013-09-01 09:07:00 1 NA NA NA NA 品牌24: 2013-09-01 09:08:00 1 2 NA NA NA 品牌35: 2013-09-01 09:11:00 1 2 2 NA NA 品牌56: 2013-09-01 09:14:00 1 2 2 2 NA 品牌27: 2013-09-01 09:26:00 1 1 2 2 NA 品牌68: 2013-09-01 09:26:00 1 1 2 2 NA 品牌29: 2013-09-01 09:29:00 2 1 3 3 3 品牌210: 2013-09-01 09:32:00 2 1 3 3 3 品牌4
说明
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),nomatch = 0L,按 = .EACHI,.SD[, .N, by = brand][, rank := frank(-N, ties.method=dense")]]
返回每 36 小时的汇总结果:
<块引用> shoptime shoptime 品牌 N 等级1: 2013-08-30 20:54:00 2013-09-01 08:54:00 品牌1 1 12: 2013-08-30 21:07:00 2013-09-01 09:07:00 品牌1 2 13: 2013-08-30 21:08:00 2013-09-01 09:08:00 品牌1 2 14: 2013-08-30 21:08:00 2013-09-01 09:08:00 品牌2 1 25: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌1 2 16: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌2 1 27: 2013-08-30 21:11:00 2013-09-01 09:11:00 品牌3 1 28: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌1 2 19: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌2 1 210: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌3 1 211: 2013-08-30 21:14:00 2013-09-01 09:14:00 品牌5 1 212: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌1 2 113: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌2 2 114: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌3 1 215: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌5 1 216: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌1 2 117: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌2 2 118: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌3 1 219: 2013-08-30 21:26:00 2013-09-01 09:26:00 品牌5 1 220: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌1 2 221: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌2 3 122: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌3 1 323: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌5 1 324: 2013-08-30 21:29:00 2013-09-01 09:29:00 品牌6 1 325: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌1 2 226: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌2 4 127: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌3 1 328: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌5 1 329: 2013-08-30 21:32:00 2013-09-01 09:32:00 品牌6 1 3shoptime shoptime品牌N等级
然后,这个中间结果从长格式改成宽格式:
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),nomatch = 0L,按 = .EACHI,.SD[, .N, by = brand][, rank := frank(-N, ties.method=dense")]][, dcast(unique(.SD[, -1]), shoptime ~ 品牌, value.var = "rank")]
<块引用>
shoptime 品牌1 品牌2 品牌3 品牌5 品牌61: 2013-09-01 08:54:00 1 NA NA NA NA2: 2013-09-01 09:07:00 1 NA NA NA NA3: 2013-09-01 09:08:00 1 2 不适用 不适用 不适用4: 2013-09-01 09:11:00 1 2 2 NA NA5: 2013-09-01 09:14:00 1 2 2 2 不适用6: 2013-09-01 09:26:00 1 1 2 2 不适用7: 2013-09-01 09:29:00 2 1 3 3 38: 2013-09-01 09:32:00 2 1 3 3 3
与原始 dat
数据框的最终右连接补全缺失的行和列(参见上面的代码和结果).
数据
dat <- data.frame(shoptime = c(2013-09-01 08:35:00 UTC", 2013-09-01 08:54:00 UTC", 2013-09-01 09:07:00 UTC",2013-09-01 09:08:00 UTC"、2013-09-01 09:11:00 UTC"、2013-09-01 09:14:00 UTC"、2013-09-01 09:26:00 UTC"、2013-09-01 09:26:00 UTC","2013-09-01 09:29:00 UTC", 2013-09-01 09:32:00 UTC"),品牌 = c(品牌 1"、品牌 1"、品牌 2"、品牌 3"、品牌 5"、品牌 2"、品牌 6"、品牌 2"、品牌 2"、品牌;brand4"),字符串AsFactors = FALSE)
My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time.
With this data, I would like to create rank variable as shown third column, fourth column, and so on.
The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC"
the rank should be based on all chosenbrand
values within 36 hours before the time. (brand1
in second row should not be in the 36 hours)
Therefore, rank_brand1, rank_brand2, rank_brand3, rank_bran4,,, are my desired variables.
If I want to create rank_brand5, rank_brand6 as well...
Is there any simple way?
In addition, if I want to do it by individual (if each customer has several purchased history), how to do that?
Data is as below,
shoptime chosenbrand rank_brand1 rank_brand2 rank_brand3, ...
2013-09-01 08:35:00 UTC brand1 NA NA NA
2013-09-01 08:54:00 UTC brand1 1 NA NA
2013-09-01 09:07:00 UTC brand2 1 2 NA
2013-09-01 09:08:00 UTC brand3 1 2 3
2013-09-01 09:11:00 UTC brand5 1 2 3
2013-09-01 09:14:00 UTC brand2 1 2 3
2013-09-01 09:26:00 UTC brand6 1 1 3
2013-09-01 09:26:00 UTC brand2 1 1 3
2013-09-01 09:29:00 UTC brand2 2 1 3
2013-09-01 09:32:00 UTC brand4 2 1 3
Here is code for data
dat <- data.frame(shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
chosenbrand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
rank_brand1 = NA,
rank_brand2 = NA,
rank_brand3 = NA,
stringsAsFactors = FALSE)
This is a tricky one. The solution below uses non-equi joins to aggregate by 36 hours periods, dcast()
to reshape from long to wide format, and a second join with the original dat
. There can be an arbitrary number of brands.
library(data.table)
library(lubridate)
setDT(dat)[, shoptime := as_datetime(shoptime)]
setorder(dat, shoptime) # not required, just for convenience of observers
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")][
dat, on = "shoptime"]
shoptime brand1 brand2 brand3 brand5 brand6 brand 1: 2013-09-01 08:35:00 NA NA NA NA NA brand1 2: 2013-09-01 08:54:00 1 NA NA NA NA brand1 3: 2013-09-01 09:07:00 1 NA NA NA NA brand2 4: 2013-09-01 09:08:00 1 2 NA NA NA brand3 5: 2013-09-01 09:11:00 1 2 2 NA NA brand5 6: 2013-09-01 09:14:00 1 2 2 2 NA brand2 7: 2013-09-01 09:26:00 1 1 2 2 NA brand6 8: 2013-09-01 09:26:00 1 1 2 2 NA brand2 9: 2013-09-01 09:29:00 2 1 3 3 3 brand2 10: 2013-09-01 09:32:00 2 1 3 3 3 brand4
Explanation
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]]
returns the aggregated results per 36 hours periods:
shoptime shoptime brand N rank 1: 2013-08-30 20:54:00 2013-09-01 08:54:00 brand1 1 1 2: 2013-08-30 21:07:00 2013-09-01 09:07:00 brand1 2 1 3: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand1 2 1 4: 2013-08-30 21:08:00 2013-09-01 09:08:00 brand2 1 2 5: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand1 2 1 6: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand2 1 2 7: 2013-08-30 21:11:00 2013-09-01 09:11:00 brand3 1 2 8: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand1 2 1 9: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand2 1 2 10: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand3 1 2 11: 2013-08-30 21:14:00 2013-09-01 09:14:00 brand5 1 2 12: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1 13: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1 14: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2 15: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2 16: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand1 2 1 17: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand2 2 1 18: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand3 1 2 19: 2013-08-30 21:26:00 2013-09-01 09:26:00 brand5 1 2 20: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand1 2 2 21: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand2 3 1 22: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand3 1 3 23: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand5 1 3 24: 2013-08-30 21:29:00 2013-09-01 09:29:00 brand6 1 3 25: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand1 2 2 26: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand2 4 1 27: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand3 1 3 28: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand5 1 3 29: 2013-08-30 21:32:00 2013-09-01 09:32:00 brand6 1 3 shoptime shoptime brand N rank
Then, this intermediate result is reshaped from long to wide format:
dat[.(lb = shoptime - hours(36), ub = shoptime), on = .(shoptime >= lb, shoptime < ub),
nomatch = 0L, by = .EACHI,
.SD[, .N, by = brand][, rank := frank(-N, ties.method="dense")]][
, dcast(unique(.SD[, -1]), shoptime ~ brand, value.var = "rank")]
shoptime brand1 brand2 brand3 brand5 brand6 1: 2013-09-01 08:54:00 1 NA NA NA NA 2: 2013-09-01 09:07:00 1 NA NA NA NA 3: 2013-09-01 09:08:00 1 2 NA NA NA 4: 2013-09-01 09:11:00 1 2 2 NA NA 5: 2013-09-01 09:14:00 1 2 2 2 NA 6: 2013-09-01 09:26:00 1 1 2 2 NA 7: 2013-09-01 09:29:00 2 1 3 3 3 8: 2013-09-01 09:32:00 2 1 3 3 3
The final right join with the original dat
data frame completes the missing rows and columns (see code and result above).
Data
dat <- data.frame(
shoptime = c("2013-09-01 08:35:00 UTC", "2013-09-01 08:54:00 UTC", "2013-09-01 09:07:00 UTC" ,"2013-09-01 09:08:00 UTC", "2013-09-01 09:11:00 UTC", "2013-09-01 09:14:00 UTC",
"2013-09-01 09:26:00 UTC", "2013-09-01 09:26:00 UTC" ,"2013-09-01 09:29:00 UTC", "2013-09-01 09:32:00 UTC"),
brand = c("brand1", "brand1", "brand2", "brand3", "brand5", "brand2", "brand6", "brand2" , "brand2" , "brand4" ),
stringsAsFactors = FALSE)
这篇关于如何在特定条件下创建排名变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!