如何在R的For循环中正确使用group_by()和summarise() [英] How to correctly use group_by() and summarise() in a For loop in R
问题描述
我正在尝试计算一些摘要信息,以帮助我检查数据集中不同组中的离群值.我可以使用 dplyr :: group_by()
和 dplyr :: summarise()
获得所需的输出类型-一个数据框,其中包含给定变量每个组的摘要信息.像这样:
I'm trying to calculate some summary information to help me check for outliers in different groups in a dataset. I can get the sort of output I want using dplyr::group_by()
and dplyr::summarise()
- a dataframe with summary information for each group for a given variable. Something like this:
Sepal.Length_outlier_check <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(Sepal.Length, na.rm = TRUE),
max = max(Sepal.Length, na.rm = TRUE),
median = median(Sepal.Length, na.rm = TRUE),
MAD = mad(Sepal.Length, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(Sepal.Length < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(Sepal.Length > MAD_highlim, na.rm = TRUE)
)
Sepal.Length_outlier_check
但是,我希望能够将其放入For循环中,以便能够为数据集中的每个不同变量生成相似的摘要数据帧.我是使用循环的新手,但我一直认为它可能需要看起来像这样:
However, I'd like to be able to put this in a For loop to be able to produce similar summary dataframes for each of the different variables in the dataset. I'm new to using loops, but I was thinking it might need to look something like this:
vars <- list(colnames(iris))
for (i in vars) {
x <- iris %>%
dplyr::group_by(Species) %>%
dplyr::summarise(min = min(i, na.rm = TRUE),
max = max(i, na.rm = TRUE),
median = median(i, na.rm = TRUE),
MAD = mad(i, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(i < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(i > MAD_highlim, na.rm = TRUE)
)
assign(paste(i, "Outlier_check", sep = "_"), x)
}
我知道这是行不通的,因为在汇总函数中 i
实际上没有引用任何数据.我不确定要使它正常工作需要做什么!非常感谢您的帮助,或者对如何更优雅地完成所有这些工作的任何建议.
I know that doesn't work though because in the summary functions i
isn't actually referencing any data. I'm not sure what I need to do to make it work though! I'd be very grateful for your help, or any suggestions for how to accomplish all of this more elegantly.
我不愿意使用dplyr :: summarise_all(),因为它为所有变量输出一个汇总表,并且由于我正在处理的实际数据集包含许多变量,因此该汇总表将变得太大而无法轻松查看它.
I'm reluctant to use dplyr::summarise_all() because it outputs one summary table for all the variables, and as the real dataset I'm working on has many variables this summary table would become too large to be able to easily review it.
谢谢.
推荐答案
您还可以简单地通过 gather
非种类列来创建这些没有循环或单独功能的每个变量/种类摘要,分组和汇总:
You could also create these per-variable/species summaries without loops or separate functions, simply by gather
ing the non-Species columns, grouping, and summarizing:
library(tidyverse)
iris.summary <- iris %>%
gather(variable, value, -Species) %>%
group_by(variable, Species) %>%
summarize(
min = min(value, na.rm = TRUE),
max = max(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
MAD = mad(value, na.rm = TRUE),
MAD_lowlim = median - (3 * MAD),
MAD_highlim = median + (3 * MAD),
Outliers_low = any(value < MAD_lowlim, na.rm = TRUE),
Outliers_high = any(value > MAD_highlim, na.rm = TRUE)
)
variable Species min max median MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
1 Petal.Length setosa 1 1.9 1.5 0.148 1.06 1.94 TRUE FALSE
2 Petal.Length versicolor 3 5.1 4.35 0.519 2.79 5.91 FALSE FALSE
3 Petal.Length virginica 4.5 6.9 5.55 0.667 3.55 7.55 FALSE FALSE
4 Petal.Width setosa 0.1 0.6 0.2 0 0.2 0.2 TRUE TRUE
5 Petal.Width versicolor 1 1.8 1.3 0.222 0.633 1.97 FALSE FALSE
6 Petal.Width virginica 1.4 2.5 2 0.297 1.11 2.89 FALSE FALSE
7 Sepal.Length setosa 4.3 5.8 5 0.297 4.11 5.89 FALSE FALSE
8 Sepal.Length versicolor 4.9 7 5.9 0.519 4.34 7.46 FALSE FALSE
9 Sepal.Length virginica 4.9 7.9 6.5 0.593 4.72 8.28 FALSE FALSE
10 Sepal.Width setosa 2.3 4.4 3.4 0.371 2.29 4.51 FALSE FALSE
11 Sepal.Width versicolor 2 3.4 2.8 0.297 1.91 3.69 FALSE FALSE
12 Sepal.Width virginica 2.2 3.8 3 0.297 2.11 3.89 FALSE FALSE
这篇关于如何在R的For循环中正确使用group_by()和summarise()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!