在R中随时间追踪同类群组 [英] tracking a cohort over time in R

查看:53
本文介绍了在R中随时间追踪同类群组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个用户ID和进行交易的月份的样本数据集。我的目标是逐月计算有多少原始用户进行交易。换句话说,一月份的 new 用户数量也在二月,三月和四月进行了交易。 2月有 new 个用户在3月和4月进行了交易,依此类推。

I have a sample dataset of user ids and months in which a transaction was made. My goal is to calculate, month over month, how many of the original users made transactions. In other words, how many users that were new in January also made transactions in February, March, and April. How many users that were new in February made transactions in March and April, and so on.

> data
       date user_id
1  Jan 2017       1
2  Jan 2017       2
3  Jan 2017       3
4  Jan 2017       4
5  Jan 2017       5
6  Feb 2017       1
7  Feb 2017       3
8  Feb 2017       5
9  Feb 2017       7
10 Feb 2017       9
11 Mar 2017       2
12 Mar 2017       4
13 Mar 2017       6
14 Mar 2017       8
15 Mar 2017      10
16 Apr 2017       1
17 Apr 2017       3
18 Apr 2017       6
19 Apr 2017       9
20 Apr 2017      12

此输出数据集看起来像这样:

The output of this dataset would look something like this:

> output
    Jan Feb Mar Apr
Jan   5   3   2   2
Feb  NA   2   0   1
Mar  NA  NA   3   1
Apr  NA  NA  NA   1

到目前为止,我想到的唯一方法是拆分数据集,然后计算每个月的唯一ID在前几个月没有出现,但是此方法比较冗长,不适用于具有许多个月的大型数据集。

So far the only way I can think of doing this is to split the dataset and then calculate the unique ids for each month that are not present in the previous months, but this method is verbose and is not suited for a large dataset with many months.

subsets <-split(data, data$date, drop=TRUE)

for (i in 1:length(subsets)) {
  assign(paste0("M", i), as.data.frame(subsets[[i]]))
}

M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)


M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))

I R中是否有一种方法可以使用 dplyr 甚至基数R以较短的方法得出以上输出?真实的数据集有很多年和几个月。

Is there a way in R to come up with the above output with a shorter method using dplyr or even base R? The real data set has many years and months.

数据的格式如下:

> sapply(data, class)
     date   user_id 
"yearmon" "integer" 

以及示例数据:

> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")


推荐答案

下面是一个示例:

library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017, 
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333, 
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667, 
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25, 
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L, 
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L, 
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
#      cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017        5        3        2        2
# 2: Feb 2017        0        2        0        1
# 3: Mrz 2017        0        0        3        1
# 4: Apr 2017        0        0        0        1

m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon") 
m
#           yearmon
# cohort     Jan 2017 Feb 2017 Mrz 2017 Apr 2017
#   Jan 2017        5        3        2        2
#   Feb 2017       NA        2        0        1
#   Mrz 2017       NA       NA        3        1
#   Apr 2017       NA       NA       NA        1

这篇关于在R中随时间追踪同类群组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆