我该如何做一个只在某些日期条件之间查找的条件和 [英] How do I do a conditional sum which only looks between certain date criteria
问题描述
说我有类似的数据
date, user, items_bought, event_number
2013-01-01, x, 2, 1
2013-01-02, x, 1, 2
2013-01-03, x, 0, 3
2013-01-04, x, 0, 4
2013-01-04, x, 1, 5
2013-01-04, x, 2, 6
2013-01-05, x, 3, 7
2013-01-06, x, 1, 8
2013-01-01, y, 1, 1
2013-01-02, y, 1, 2
2013-01-03, y, 0, 3
2013-01-04, y, 5, 4
2013-01-05, y, 6, 5
2013-01-06, y, 1, 6
获取我正在做的每个数据点的每个用户的累积总和
to get the cumulative sum per user per data point I was doing
data.frame(cum_items_bought=unlist(tapply(as.numeric(data$items_bought), data$user, FUN = cumsum)))
此输出看起来像
date, user, items_bought
2013-01-01, x, 2
2013-01-02, x, 3
2013-01-03, x, 3
2013-01-04, x, 3
2013-01-04, x, 4
2013-01-04, x, 6
2013-01-05, x, 9
2013-01-06, x, 10
2013-01-01, y, 1
2013-01-02, y, 2
2013-01-03, y, 2
2013-01-04, y, 7
2013-01-05, y, 13
2013-01-06, y, 14
但是,我想限制我的总和,以仅将每行3天之内发生的总和(相对于用户)相加.即输出需要看起来像这样:
However I want to restrict my sum to only add up those that happened within 3 days of each row (relative to the user). i.e. the output needs to look like this:
date, user, cum_items_bought_3_days
2013-01-01, x, 2
2013-01-02, x, 3
2013-01-03, x, 3
2013-01-04, x, 1
2013-01-04, x, 2
2013-01-04, x, 4
2013-01-05, x, 6
2013-01-06, x, 7
2013-01-01, y, 1
2013-01-02, y, 2
2013-01-03, y, 2
2013-01-04, y, 6
2013-01-05, y, 11
2013-01-06, y, 12
推荐答案
这是一个dplyr
解决方案,它将产生在问题中指定的所需结果(14行).请注意,它会处理重复的日期条目,例如,用户x的日期为2013-01-04.
Here's a dplyr
solution which will produce the desired result (14 rows) as specified in the question. Note that it takes care of duplicate date entries, for example, 2013-01-04 for user x.
# define a custom function to be used in the dplyr chain
myfunc <- function(x){
with(x, sapply(event_number, function(y)
sum(items_bought[event_number <= event_number[y] & date[y] - date <= 2])))
}
require(dplyr) #install and load into your library
df %>%
mutate(date = as.Date(as.character(date))) %>%
group_by(user) %>%
do(data.frame(., cum_items_bought_3_days = myfunc(.))) %>%
select(-c(items_bought, event_number))
# date user cum_items_bought_3_days
#1 2013-01-01 x 2
#2 2013-01-02 x 3
#3 2013-01-03 x 3
#4 2013-01-04 x 1
#5 2013-01-04 x 2
#6 2013-01-04 x 4
#7 2013-01-05 x 6
#8 2013-01-06 x 7
#9 2013-01-01 y 1
#10 2013-01-02 y 2
#11 2013-01-03 y 2
#12 2013-01-04 y 6
#13 2013-01-05 y 11
#14 2013-01-06 y 12
在我的回答中,我在dplyr
链中使用了自定义函数myfunc
.这是使用dplyr
中的do
运算符完成的.自定义函数由user
组传递给子集df.然后,它使用sapply
传递每个event_number
并计算items_bought
的总和. dplyr
链的最后一行取消选择了不需要的列.
In my answer I use a custom function myfunc
inside a dplyr
chain. This is done using the do
operator from dplyr
. The custom function is passed the subsetted df by user
groups. It then uses sapply
to pass each event_number
and calculate the sums of items_bought
. The last line of the dplyr
chain deselects the undesired columns.
如果您需要更详细的说明,请告诉我.
Let me know if you'd like a more detailed explanation.
如果需要更大的灵活性来有条件地总结其他列,则可以按以下方式调整代码.我在这里假设,其他列的汇总方式应与items_bought
相同.如果那是不正确的,请指定您要如何总结其他列.
If you need more flexibility to also conditionally sum up other columns, you can adjust the code as follows. I assume here, that the other columns should be summed up the same way as items_bought
. If that is not correct, please specify how you want to sum up the other columns.
我首先在数据中创建另外两个带有随机数的列(我将在答案底部发布数据的dput
)
I first create two additional columns with random numbers in the data (I'll post a dput
of the data at the bottom of my answer):
set.seed(99) # for reproducibility only
df$newCol1 <- sample(0:10, 14, replace=T)
df$newCol2 <- runif(14)
df
# date user items_bought event_number newCol1 newCol2
#1 2013-01-01 x 2 1 6 0.687800094
#2 2013-01-02 x 1 2 1 0.640190769
#3 2013-01-03 x 0 3 7 0.357885360
#4 2013-01-04 x 0 4 10 0.102584999
#5 2013-01-04 x 1 5 5 0.097790922
#6 2013-01-04 x 2 6 10 0.182886256
#7 2013-01-05 x 3 7 7 0.227903474
#8 2013-01-06 x 1 8 3 0.080524150
#9 2013-01-01 y 1 1 3 0.821618422
#10 2013-01-02 y 1 2 1 0.591113977
#11 2013-01-03 y 0 3 6 0.773389019
#12 2013-01-04 y 5 4 5 0.350085977
#13 2013-01-05 y 6 5 2 0.006061323
#14 2013-01-06 y 1 6 7 0.814506223
接下来,您可以修改myfunc
以使用2个参数,而不是1.第一个参数将像以前一样保留子集data.frame(在dplyr链中由.
表示,在函数定义中由x
表示) myfunc
),而myfunc
的第二个参数将指定要累加的列(colname
).
Next, you can modify myfunc
to take 2 arguments, instead of 1. The first argument will remain the subsetted data.frame as before (represented by .
inside the dplyr chain and x
in the function definition of myfunc
), while the second argument to myfunc
will specify the column to sum up (colname
).
myfunc <- function(x, colname){
with(x, sapply(event_number, function(y)
sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname])))
}
然后,如果要有条件地汇总几列,则可以多次使用myfunc
:
Then, you can use myfunc
several times if you want to conditionally sum up several columns:
df %>%
mutate(date = as.Date(as.character(date))) %>%
group_by(user) %>%
do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"),
newCol1Sums = myfunc(., "newCol1"),
newCol2Sums = myfunc(., "newCol2"))) %>%
select(-c(items_bought, event_number, newCol1, newCol2))
# date user cum_items_bought_3_days newCol1Sums newCol2Sums
#1 2013-01-01 x 2 6 0.6878001
#2 2013-01-02 x 3 7 1.3279909
#3 2013-01-03 x 3 14 1.6858762
#4 2013-01-04 x 1 18 1.1006611
#5 2013-01-04 x 2 23 1.1984520
#6 2013-01-04 x 4 33 1.3813383
#7 2013-01-05 x 6 39 0.9690510
#8 2013-01-06 x 7 35 0.6916898
#9 2013-01-01 y 1 3 0.8216184
#10 2013-01-02 y 2 4 1.4127324
#11 2013-01-03 y 2 10 2.1861214
#12 2013-01-04 y 6 12 1.7145890
#13 2013-01-05 y 11 13 1.1295363
#14 2013-01-06 y 12 14 1.1706535
现在,您创建了列items_bought
,newCol1
和newCol2
的条件总和.您还可以省略dplyr链中的任何总和,也可以添加更多列进行总和.
Now you created conditional sums of the columns items_bought
, newCol1
and newCol2
. You can also leave out any of the sums in the dplyr chain or add more columns to sum up.
要计算每个用户购买的不同(独特)商品的累积总和,您可以定义第二个自定义函数myfunc2
并在dplyr链中使用它.该函数还具有myfunc
的灵活性,因此您可以定义要将函数应用到的列.
To calculate the cumulative sum of distinct (unique) items bought per user, you could define a second custom function myfunc2
and use it inside the dplyr chain. This function is also flexible as myfunc
so that you can define the columns to which you want to apply the function.
代码将是:
myfunc <- function(x, colname){
with(x, sapply(event_number, function(y)
sum(x[event_number <= event_number[y] & date[y] - date <= 2, colname])))
}
myfunc2 <- function(x, colname){
cumsum(sapply(seq_along(x[[colname]]), function(y)
ifelse(!y == 1 & x[y, colname] %in% x[1:(y-1), colname], 0, 1)))
}
require(dplyr) #install and load into your library
dd %>%
mutate(date = as.Date(as.character(date))) %>%
group_by(user) %>%
do(data.frame(., cum_items_bought_3_days = myfunc(., "items_bought"),
newCol1Sums = myfunc(., "newCol1"),
newCol2Sums = myfunc(., "newCol2"),
distinct_items_bought = myfunc2(., "items_bought"))) %>%
select(-c(items_bought, event_number, newCol1, newCol2))
这是我使用的数据:
dput(df)
structure(list(date = structure(c(1L, 2L, 3L, 4L, 4L, 4L, 5L,
6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("2013-01-01", "2013-01-02",
"2013-01-03", "2013-01-04", "2013-01-05", "2013-01-06"), class = "factor"),
user = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c(" x", " y"), class = "factor"),
items_bought = c(2L, 1L, 0L, 0L, 1L, 2L, 3L, 1L, 1L, 1L,
0L, 5L, 6L, 1L), event_number = c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L), newCol1 = c(6L, 1L, 7L,
10L, 5L, 10L, 7L, 3L, 3L, 1L, 6L, 5L, 2L, 7L), newCol2 = c(0.687800094485283,
0.640190769452602, 0.357885359786451, 0.10258499882184, 0.0977909218054265,
0.182886255905032, 0.227903473889455, 0.0805241498164833,
0.821618422167376, 0.591113976901397, 0.773389018839225,
0.350085976999253, 0.00606132275424898, 0.814506222726777
)), .Names = c("date", "user", "items_bought", "event_number",
"newCol1", "newCol2"), row.names = c(NA, -14L), class = "data.frame")
这篇关于我该如何做一个只在某些日期条件之间查找的条件和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!