使用dplyr高效汇总具有不同功能的不同列 [英] Efficient summarise of different columns with different functions with dplyr
问题描述
下面的代码正是我想要的.获取具有多个分组约束(即id和date)的数据框,并为每个数据返回1行,并将不同的功能应用于不同的列.在这种情况下,请折叠成字符并通过分组来占用列的最大值.
The code below does exactly what I want it to do. Takes a data frame with multiple grouping constraints (ie. id and date) and returns 1 row for each with different functions applied to different columns. In this case, collapsing to character and taking the maximum of the column by grouping.
问题在于将这种方法扩展到具有10万行和组的更大数据帧.我怎样才能更有效地得出相同的结论?如果在整排结构中可行,则这是首选的(虽然不是首选,但data.table解决方案也将被接受).我想保持灵活性,以指定要包含在函数调用中的不同列,并添加其他摘要调用(例如summarise_at(值的总和)).
The issue is scaling this approach to much bigger data frames with 100k rows and groups. How can I reach the same conclusion in a more efficient manner? If this is possible in a tidyverse structure this is preferred (while not preferred, data.table solutions will also be accepted). I want to keep the flexibility to specify the different cols to be included in the function calls, as well as add additional summary calls (e.g. summarise_at(sum of values)).
data <- tibble(id = c(1,1,1,2,3,4,5,5,6,6,6), date = dmy("01/01/2020"), var1 = 1:11, var2 = 12:22, var3 = 1:11)
data %>%
group_by(id, date) %>%
{data.frame(
summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
summarise(., count = n(), .groups = "keep") # return count of cases in group
)} %>%
select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
as_tibble()
# A tibble: 6 x 6
id date var1 var2 var3 count
<dbl> <date> <chr> <chr> <int> <int>
1 1 2020-01-01 1 AND 2 AND 3 12 AND 13 AND 14 3 3
2 2 2020-01-01 4 15 4 1
3 3 2020-01-01 5 16 5 1
4 4 2020-01-01 6 17 6 1
5 5 2020-01-01 7 AND 8 18 AND 19 8 2
6 6 2020-01-01 9 AND 10 AND 11 20 AND 21 AND 22 11 3
也链接到此问题.如何使用summarise_at来将不同的功能应用于不同的列?
推荐答案
这是 tidyverse
的方法,改编自 @MichaelDewar的答案,虽然比较整齐,但我不认为效率有任何真正的提高.此外,在我看来,具有10万行的数据帧没什么大不了的.我认为 tidyverse
解决方案就可以了.
Here is the tidyverse
approach adapted from @MichaelDewar's answer, it's tidier but I don't think there is any real improvement in the efficiency. Besides, a dataframe with 100k rows is not that big a deal in my opinion. I think a tidyverse
solution is just fine.
library(dplyr)
data %>%
group_by(id, date) %>%
summarise(
across(c(var1, var2), ~paste(unique(.), collapse = " AND ")),
across(var3, max, na.rm = T),
count = n(), .groups = "keep"
)
但是,如果您真的想提高效率,请尝试使用此data.table解决方案
But if you really want to improve the efficiency, perhaps try this data.table solution
library(data.table)
setDT(data)[, c(
lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")),
list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]
基准
set.seed(2020)
data2 <- data[sample.int(nrow(data), 1e5, T), ]
data22 <- data.table::copy(data2)
f1 <-
. %>%
group_by(id, date) %>%
{data.frame(
summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
summarise(., count = n(), .groups = "keep") # return count of cases in group
)} %>%
select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
as_tibble()
f2 <-
. %>%
group_by(id, date) %>%
summarise(
across(c(var1, var2), ~paste(unique(.), collapse = " AND ")),
across(var3, max, na.rm = T),
count = n(),
.groups = "keep"
)
f3 <- function(dt) {
setDT(dt)[, c(
lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")),
list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]
}
microbenchmark::microbenchmark(f1(data2), f2(data2), f3(data22))
结果
Unit: milliseconds
expr min lq mean median uq max neval cld
f1(data2) 19.6730 20.27990 20.841344 20.50850 20.85045 29.2799 100 c
f2(data2) 13.5455 14.09240 14.705967 14.34585 14.64625 20.5914 100 b
f3(data22) 6.9186 7.80615 8.598227 8.32035 8.68040 15.8358 100 a
这篇关于使用dplyr高效汇总具有不同功能的不同列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!