使用dplyr高效汇总具有不同功能的不同列 [英] Efficient summarise of different columns with different functions with dplyr

查看:44
本文介绍了使用dplyr高效汇总具有不同功能的不同列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

下面的代码正是我想要的.获取具有多个分组约束(即id和date)的数据框,并为每个数据返回1行,并将不同的功能应用于不同的列.在这种情况下,请折叠成字符并通过分组来占用列的最大值.

The code below does exactly what I want it to do. Takes a data frame with multiple grouping constraints (ie. id and date) and returns 1 row for each with different functions applied to different columns. In this case, collapsing to character and taking the maximum of the column by grouping.

问题在于将这种方法扩展到具有10万行和组的更大数据帧.我怎样才能更有效地得出相同的结论?如果在整排结构中可行,则这是首选的(虽然不是首选,但data.table解决方案也将被接受).我想保持灵活性,以指定要包含在函数调用中的不同列,并添加其他摘要调用(例如summarise_at(值的总和)).

The issue is scaling this approach to much bigger data frames with 100k rows and groups. How can I reach the same conclusion in a more efficient manner? If this is possible in a tidyverse structure this is preferred (while not preferred, data.table solutions will also be accepted). I want to keep the flexibility to specify the different cols to be included in the function calls, as well as add additional summary calls (e.g. summarise_at(sum of values)).

data <- tibble(id = c(1,1,1,2,3,4,5,5,6,6,6), date = dmy("01/01/2020"), var1 =  1:11, var2 = 12:22, var3 = 1:11)

data %>% 
group_by(id, date) %>%
    {data.frame(
            summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
            summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
            summarise(., count = n(), .groups = "keep") # return count of cases in group
        )} %>% 
    select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
    as_tibble()

# A tibble: 6 x 6
     id date       var1            var2              var3 count
  <dbl> <date>     <chr>           <chr>            <int> <int>
1     1 2020-01-01 1 AND 2 AND 3   12 AND 13 AND 14     3     3
2     2 2020-01-01 4               15                   4     1
3     3 2020-01-01 5               16                   5     1
4     4 2020-01-01 6               17                   6     1
5     5 2020-01-01 7 AND 8         18 AND 19            8     2
6     6 2020-01-01 9 AND 10 AND 11 20 AND 21 AND 22    11     3

也链接到此问题.如何使用summarise_at来将不同的功能应用于不同的列?

推荐答案

这是 tidyverse 的方法,改编自 @MichaelDewar的答案,虽然比较整齐,但我不认为效率有任何真正的提高.此外,在我看来,具有10万行的数据帧没什么大不了的.我认为 tidyverse 解决方案就可以了.

Here is the tidyverse approach adapted from @MichaelDewar's answer, it's tidier but I don't think there is any real improvement in the efficiency. Besides, a dataframe with 100k rows is not that big a deal in my opinion. I think a tidyverse solution is just fine.

library(dplyr)

data %>% 
  group_by(id, date) %>% 
  summarise(
    across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
    across(var3, max, na.rm = T), 
    count = n(), .groups = "keep"
  )

但是,如果您真的想提高效率,请尝试使用此data.table解决方案

But if you really want to improve the efficiency, perhaps try this data.table solution

library(data.table)

setDT(data)[, c(
  lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
  list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]

基准

set.seed(2020)
data2 <- data[sample.int(nrow(data), 1e5, T), ]
data22 <- data.table::copy(data2)

f1 <- 
  . %>% 
  group_by(id, date) %>%
  {data.frame(
    summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
    summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
    summarise(., count = n(), .groups = "keep") # return count of cases in group
  )} %>% 
  select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
  as_tibble()

f2 <- 
  . %>% 
  group_by(id, date) %>% 
  summarise(
    across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
    across(var3, max, na.rm = T), 
    count = n(), 
    .groups = "keep"
  )

f3 <- function(dt) {
  setDT(dt)[, c(
    lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
    list(var3 = max(var3, na.rm = T), count = .N)
  ), by = c("id", "date")]
}

microbenchmark::microbenchmark(f1(data2), f2(data2), f3(data22))

结果

Unit: milliseconds
       expr     min       lq      mean   median       uq     max neval cld
  f1(data2) 19.6730 20.27990 20.841344 20.50850 20.85045 29.2799   100   c
  f2(data2) 13.5455 14.09240 14.705967 14.34585 14.64625 20.5914   100  b 
 f3(data22)  6.9186  7.80615  8.598227  8.32035  8.68040 15.8358   100 a  

这篇关于使用dplyr高效汇总具有不同功能的不同列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆