groupby在groupby dplyr之外进行总结 [英] groupby summarise outside of groupby dplyr

查看:91
本文介绍了groupby在groupby dplyr之外进行总结的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将ID与该数据集中的日期进行分组,但是我想根据组外的功能之一进行汇总。

I'm trying to group ids with date in this dataset, but I want to summarise based on one of the features outside of the group.

library(dplyr)
library(lubridate)

set.seed(100)
df <- data.frame(ids = sample(c('436247', '2465347', '346654645'), 10000, replace=TRUE),
                 date = sample(seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1), 10000, replace=TRUE))

new_df <- df %>%
    group_by(ids, date) %>%
    summarise(events = length(ids[date >= date - 30 & date <= date]))

我正在尝试获取此数据框并回答问题-对于每个ID,以及每个日期,该ID内还有多少其他记录,都在该日期的过去30天内。不幸的是,当我同时 group_by 时,它只在分组日期内显示。我已经在下面创建了解决方案,但是不确定dplyr是否有更好的解决方案?

I'm trying to take this dataframe and answer the question - "for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date". Unfortunately, when I group_by both the ids and date, it only looks within the grouped date. I've created the solution below, but not sure if there is a better one with dplyr?

groupby_function <- function(df, spec_date){
  result <- df %>%
      group_by(ids) %>%
      summarise(events = length(ids[date >= spec_date - 30 & date <= spec_date])) %>%
      mutate(date = spec_date)
  return(result)

} 

date_vector <- seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1)
list_results <- lapply(date_vector, groupby_function, df=df)
x <- do.call(rbind, list_results)


推荐答案


,每个ID以及每个日期,该ID内的其他几条记录均在该日期的过去30天内

"for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date"

为此, join by条件是有意义的,但。在此之前,您可以在dplyr链中使用data.table:

For that, a "join by" condition makes sense, but isn't yet included in dplyr. Until it is, you could use data.table inside your dplyr chain:

# enumerate id-date combos of interest
grid_df = expand.grid(
  id = unique(df$ids), 
  d = seq(min(df$date), max(df$date), by="day")
)

# helper function
library(data.table)
count_matches = function(DF, targetDF, ...){
  onexpr = substitute(list(...))
  data.table(targetDF)[DF, on=eval(onexpr), .N, by=.EACHI]$N
}

# use a non-equi join to count matching rows
res = grid_df %>% 
  mutate(d_dn = d - 30) %>% 
  mutate(n = count_matches(., df, ids = id, date >= d_dn, date <= d)) %>% 
  as.tibble

# A tibble: 186 x 4
          id          d       d_dn     n
      <fctr>     <date>     <date> <int>
 1    436247 2018-03-01 2018-01-30    72
 2   2465347 2018-03-01 2018-01-30    69
 3 346654645 2018-03-01 2018-01-30    51
 4    436247 2018-03-02 2018-01-31   123
 5   2465347 2018-03-02 2018-01-31   120
 6 346654645 2018-03-02 2018-01-31   100
 7    436247 2018-03-03 2018-02-01   170
 8   2465347 2018-03-03 2018-02-01   166
 9 346654645 2018-03-03 2018-02-01   154
10    436247 2018-03-04 2018-02-02   228
# ... with 176 more rows

对于平等条件,可以很好地编写 ids = id ids == id

It should work fine for equality conditions to write either ids = id or ids == id, I think.

如果您感兴趣,语法为 x [i,on =,j,by = .EACHI] ,其中 x i 是表格。对于 i 的每一行,我们根据 on = <来查找 x 的行。 / code>条件(左侧是 x 中的列;右侧是 i );然后我们对每个变量执行 j (按 i 的每一行,所以 by =。 EACHI )。在这种情况下, j = .N 意味着我们对匹配的 x 行进行计数,并返回为计数列 N

If you're interested, the syntax is x[i, on=, j, by=.EACHI] where x and i are tables. For each row of i, we look up rows of x based on the on= criteria (left-hand side refers to columns in x; right-hand to columns in i); then we do j for each ("by each row of i" so by=.EACHI). In this case, j = .N means that we count matched rows of x, returned as a column of counts N.

这篇关于groupby在groupby dplyr之外进行总结的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆