从数据帧中每个索引的随机开始日期开始采样n个连续日期 [英] Sample n consecutive dates from a random starting date for each index in a data frame

查看:43
本文介绍了从数据帧中每个索引的随机开始日期开始采样n个连续日期的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑以下DataFrame:

Consider the following DataFrame:

    DF = structure(list(c_number = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L), date = c("2001-01-06", "2001-01-07", "2001-01-08", 
"2001-01-09", "2001-01-10", "2001-01-11", "2001-01-12", "2001-01-13", 
"2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17", "2001-01-18", 
"2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22", "2001-01-23", 
"2001-01-24", "2001-01-25", "2001-01-26", "2001-01-11", "2001-01-12", 
"2001-01-13", "2001-01-14", "2001-01-15", "2001-01-16", "2001-01-17", 
"2001-01-18", "2001-01-19", "2001-01-20", "2001-01-21", "2001-01-22", 
"2001-01-23", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27", 
"2001-01-28", "2001-01-12", "2001-01-13", "2001-01-14", "2001-01-15", 
"2001-01-16", "2001-01-17", "2001-01-18", "2001-01-19", "2001-01-20", 
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25", 
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30", 
"2001-01-21", "2001-01-22", "2001-01-23", "2001-01-24", "2001-01-25", 
"2001-01-26", "2001-01-27", "2001-01-28", "2001-01-29", "2001-01-30", 
"2001-01-31", "2001-01-24", "2001-01-25", "2001-01-26", "2001-01-27", 
"2001-01-28", "2001-01-29", "2001-01-30", "2001-01-31", "2001-02-01"
), value = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("c_number", 
"date", "value"), row.names = c(NA, -78L), class = "data.frame")

我有连续5个客户的销售数据;对于客户1,我有21个连续日期的销售数据....对于客户#5,我有9个连续日期的销售数据...:

I have sales data for 5 customer on consecutive dates; For customer 1, I have sales data on 21 consecutive dates....for customer # 5, I have sales data on 9 consecutive dates...:

> table(DF[, 1])

 1  2  3  4  5 
21 18 19 11  9

对于每个客户,我想对连续15天(如果我有至少15个连续日期的该客户)或该客户的所有日期(如果我没有15个连续的日期)的子DF进行抽样.

For each customer I want to sample a sub DF of 15 consecutive days (If I have at least 15 consecutive dates for that customer) or all dates for that customer (if I don't have 15 consecutive dates for that customer).

关键部分是,在情况1(如果我有至少15个连续日期为该客户)的情况下,这15个连续天应具有一个随机的开始日期(例如,并非始终是客户的第一个或最后15个日期)避免在分析中引入偏见.

The key part is that in case 1 (If I have at least 15 consecutive dates for that customer) those 15 consecutive days should have a random start date (e.g. not always be the first or last 15 dates for an customer) to avoid introducing a bias in the analysis.

在普通R中,我会这样做:

In plain R I would do:

library(dplyr)

slow_function <- function(i, DF, length_out = 15){
  sub_DF = DF[DF$c_number == i, ]
  if(nrow(sub_DF) <= length_out){
    out_DF = sub_DF
  } else {
    random_start = sample.int(nrow(sub_DF) - length_out, 1)
    out_DF = sub_DF[random_start:(random_start + length_out - 1), ]
  }
}
a_out = lapply(1:nrow(a_1), slow_function, DF = DF, length_out = 15)
a_out = dplyr::bind_rows(a_out)


table(a_out[, 1])
 1  2  3  4  5 
15 15 15 11  9 

但是我的数据要大得多,并且上述操作速度慢得令人难以忍受.有没有一种快速的方法可以在data.table/dplyr中获得相同的结果?

But my data is much larger and the operation above unbearably slow. Is there a fast way to obtain the same result in data.table/dplyr?

num_customer = 10
m   = 2 * num_customer
a_0 = seq(as.Date("2001-01-01"), as.Date("2001-12-31"), by = "day")
a_1 = matrix(sort(sample(as.character(a_0), m)), nc = 2)
a_2 = list()
for(i in 1:nrow(a_1)){
  a_3 = seq(as.Date(a_1[i, 1]), as.Date(a_1[i, 2]), by = "day")
  a_4 = data.frame(i, as.character(a_3), round(runif(length(a_3), 1)))
  colnames(a_4) = c("c_number", "date", "value")
  a_2[[i]] = a_4
}
DF = dplyr::bind_rows(a_2)
dim(DF)
table(DF[, 1])
dput(DF)

Edit2:

在10万客户DF上,Christoph Wolk的解决方案是最快的.接下来是G. Grothendieck的(大约四倍的时间),接下来是内森·沃斯(Nathan Werth)(比格罗腾迪克(G. Grothendieck)慢2倍).其他解决方案明显较慢.尽管如此,所有提案都比我尝试性的"slow_function"要快,所以谢谢大家!

on a 100k customer DF, Christoph Wolk's solution is the fastest. Next is G. Grothendieck's (about 4 times more time), next is Nathan Werth's (another factor of 2 slower than G. Grothendieck's). The other solutions are noticeably slower. Still, all proposals are faster than my tentative 'slow_function' so thanks to everyone!

推荐答案

尝试一下:

sample15consecutive <- function(DF) {
runs <- rle(DF$c_number)$lengths
start <- ifelse(runs > 15, sapply(pmax(runs-15, 1), sample.int, size=1), 1)
end <- ifelse(runs >= 15, 15, runs)
previous <- cumsum(c(0, head(runs, -1)))
DF[unlist(mapply(seq, previous + start, previous + start + end - 1), length),]
}

根据微基准测试,速度大约提高了4倍.c_number和日期必须排序.

It's about 4 times faster according to microbenchmark. The c_numbers and dates have to be sorted.

这篇关于从数据帧中每个索引的随机开始日期开始采样n个连续日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆