groupby在groupby dplyr之外进行总结 [英] groupby summarise outside of groupby dplyr
问题描述
我正在尝试将ID与该数据集中的日期进行分组,但是我想根据组外的功能之一进行汇总。
I'm trying to group ids with date in this dataset, but I want to summarise based on one of the features outside of the group.
library(dplyr)
library(lubridate)
set.seed(100)
df <- data.frame(ids = sample(c('436247', '2465347', '346654645'), 10000, replace=TRUE),
date = sample(seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1), 10000, replace=TRUE))
new_df <- df %>%
group_by(ids, date) %>%
summarise(events = length(ids[date >= date - 30 & date <= date]))
我正在尝试获取此数据框并回答问题-对于每个ID,以及每个日期,该ID内还有多少其他记录,都在该日期的过去30天内。不幸的是,当我同时
I'm trying to take this dataframe and answer the question - "for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date". Unfortunately, when I group_by
both the ids and date, it only looks within the grouped date. I've created the solution below, but not sure if there is a better one with dplyr?
groupby_function <- function(df, spec_date){
result <- df %>%
group_by(ids) %>%
summarise(events = length(ids[date >= spec_date - 30 & date <= spec_date])) %>%
mutate(date = spec_date)
return(result)
}
date_vector <- seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1)
list_results <- lapply(date_vector, groupby_function, df=df)
x <- do.call(rbind, list_results)
推荐答案
,每个ID以及每个日期,该ID内的其他几条记录均在该日期的过去30天内
"for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date"
为此, join by条件是有意义的,但。在此之前,您可以在dplyr链中使用data.table:
For that, a "join by" condition makes sense, but isn't yet included in dplyr. Until it is, you could use data.table inside your dplyr chain:
# enumerate id-date combos of interest
grid_df = expand.grid(
id = unique(df$ids),
d = seq(min(df$date), max(df$date), by="day")
)
# helper function
library(data.table)
count_matches = function(DF, targetDF, ...){
onexpr = substitute(list(...))
data.table(targetDF)[DF, on=eval(onexpr), .N, by=.EACHI]$N
}
# use a non-equi join to count matching rows
res = grid_df %>%
mutate(d_dn = d - 30) %>%
mutate(n = count_matches(., df, ids = id, date >= d_dn, date <= d)) %>%
as.tibble
# A tibble: 186 x 4
id d d_dn n
<fctr> <date> <date> <int>
1 436247 2018-03-01 2018-01-30 72
2 2465347 2018-03-01 2018-01-30 69
3 346654645 2018-03-01 2018-01-30 51
4 436247 2018-03-02 2018-01-31 123
5 2465347 2018-03-02 2018-01-31 120
6 346654645 2018-03-02 2018-01-31 100
7 436247 2018-03-03 2018-02-01 170
8 2465347 2018-03-03 2018-02-01 166
9 346654645 2018-03-03 2018-02-01 154
10 436247 2018-03-04 2018-02-02 228
# ... with 176 more rows
对于平等条件,可以很好地编写 ids = id
或 ids == id 我认为code>。
It should work fine for equality conditions to write either ids = id
or ids == id
, I think.
如果您感兴趣,语法为 x [i,on =,j,by = .EACHI]
,其中 x
和 i
是表格。对于 i
的每一行,我们根据 on = <来查找
x
的行。 / code>条件(左侧是 x
中的列;右侧是 i
);然后我们对每个变量执行 j
(按 i
的每一行,所以 by =。 EACHI
)。在这种情况下, j = .N
意味着我们对匹配的 x
行进行计数,并返回为计数列 N
。
If you're interested, the syntax is x[i, on=, j, by=.EACHI]
where x
and i
are tables. For each row of i
, we look up rows of x
based on the on=
criteria (left-hand side refers to columns in x
; right-hand to columns in i
); then we do j
for each ("by each row of i
" so by=.EACHI
). In this case, j = .N
means that we count matched rows of x
, returned as a column of counts N
.
这篇关于groupby在groupby dplyr之外进行总结的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!