groupby在groupby dplyr之外进行总结 [英] groupby summarise outside of groupby dplyr

查看：91 发布时间：2020/10/26 3:29:29 r dplyr

本文介绍了groupby在groupby dplyr之外进行总结的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试将ID与该数据集中的日期进行分组，但是我想根据组外的功能之一进行汇总。

I'm trying to group ids with date in this dataset, but I want to summarise based on one of the features outside of the group.

library(dplyr)
library(lubridate)

set.seed(100)
df <- data.frame(ids = sample(c('436247', '2465347', '346654645'), 10000, replace=TRUE),
                 date = sample(seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1), 10000, replace=TRUE))

new_df <- df %>%
    group_by(ids, date) %>%
    summarise(events = length(ids[date >= date - 30 & date <= date]))

我正在尝试获取此数据框并回答问题-对于每个ID，以及每个日期，该ID内还有多少其他记录，都在该日期的过去30天内。不幸的是，当我同时 group_by 时，它只在分组日期内显示。我已经在下面创建了解决方案，但是不确定dplyr是否有更好的解决方案？

I'm trying to take this dataframe and answer the question - "for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date". Unfortunately, when I group_by both the ids and date, it only looks within the grouped date. I've created the solution below, but not sure if there is a better one with dplyr?

groupby_function <- function(df, spec_date){
  result <- df %>%
      group_by(ids) %>%
      summarise(events = length(ids[date >= spec_date - 30 & date <= spec_date])) %>%
      mutate(date = spec_date)
  return(result)

} 

date_vector <- seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1)
list_results <- lapply(date_vector, groupby_function, df=df)
x <- do.call(rbind, list_results)

推荐答案

，每个ID以及每个日期，该ID内的其他几条记录均在该日期的过去30天内

"for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date"

为此， join by条件是有意义的，但。在此之前，您可以在dplyr链中使用data.table：

For that, a "join by" condition makes sense, but isn't yet included in dplyr. Until it is, you could use data.table inside your dplyr chain:

# enumerate id-date combos of interest
grid_df = expand.grid(
  id = unique(df$ids), 
  d = seq(min(df$date), max(df$date), by="day")
)

# helper function
library(data.table)
count_matches = function(DF, targetDF, ...){
  onexpr = substitute(list(...))
  data.table(targetDF)[DF, on=eval(onexpr), .N, by=.EACHI]$N
}

# use a non-equi join to count matching rows
res = grid_df %>% 
  mutate(d_dn = d - 30) %>% 
  mutate(n = count_matches(., df, ids = id, date >= d_dn, date <= d)) %>% 
  as.tibble

# A tibble: 186 x 4
          id          d       d_dn     n
      <fctr>     <date>     <date> <int>
 1    436247 2018-03-01 2018-01-30    72
 2   2465347 2018-03-01 2018-01-30    69
 3 346654645 2018-03-01 2018-01-30    51
 4    436247 2018-03-02 2018-01-31   123
 5   2465347 2018-03-02 2018-01-31   120
 6 346654645 2018-03-02 2018-01-31   100
 7    436247 2018-03-03 2018-02-01   170
 8   2465347 2018-03-03 2018-02-01   166
 9 346654645 2018-03-03 2018-02-01   154
10    436247 2018-03-04 2018-02-02   228
# ... with 176 more rows

对于平等条件，可以很好地编写 ids = id 或 ids == id 。


It should work fine for equality conditions to write either ids = id or ids == id, I think.
如果您感兴趣，语法为 x [i，on =，j，by = .EACHI] ，其中 x 和 i 是表格。对于 i 的每一行，我们根据 on = <来查找 x 的行。 / code>条件（左侧是 x 中的列；右侧是 i ）;然后我们对每个变量执行 j （按 i 的每一行，所以 by =。 EACHI ）。在这种情况下， j = .N 意味着我们对匹配的 x 行进行计数，并返回为计数列 N 。

If you're interested, the syntax is x[i, on=, j, by=.EACHI] where x and i are tables. For each row of i, we look up rows of x based on the on= criteria (left-hand side refers to columns in x; right-hand to columns in i); then we do j for each ("by each row of i" so by=.EACHI). In this case, j = .N means that we count matched rows of x, returned as a column of counts N.

                        这篇关于groupby在groupby dplyr之外进行总结的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

groupby在groupby dplyr之外进行总结 [英] groupby summarise outside of groupby dplyr

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

groupby在groupby dplyr之外进行总结 [英] groupby summarise outside of groupby dplyr

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭