如何在R的For循环中正确使用group_by()和summarise() [英] How to correctly use group_by() and summarise() in a For loop in R

查看:128
本文介绍了如何在R的For循环中正确使用group_by()和summarise()的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算一些摘要信息,以帮助我检查数据集中不同组中的离群值.我可以使用 dplyr :: group_by() dplyr :: summarise()获得所需的输出类型-一个数据框,其中包含给定变量每个组的摘要信息.像这样:

I'm trying to calculate some summary information to help me check for outliers in different groups in a dataset. I can get the sort of output I want using dplyr::group_by() and dplyr::summarise() - a dataframe with summary information for each group for a given variable. Something like this:

Sepal.Length_outlier_check <- iris %>%
  dplyr::group_by(Species) %>% 
  dplyr::summarise(min = min(Sepal.Length, na.rm = TRUE),
                   max = max(Sepal.Length, na.rm = TRUE),
                   median = median(Sepal.Length, na.rm = TRUE),
                   MAD = mad(Sepal.Length, na.rm = TRUE),
                   MAD_lowlim = median - (3 * MAD),
                   MAD_highlim = median + (3 * MAD),
                   Outliers_low = any(Sepal.Length < MAD_lowlim, na.rm = TRUE),
                   Outliers_high = any(Sepal.Length > MAD_highlim, na.rm = TRUE)
                   )

Sepal.Length_outlier_check

但是,我希望能够将其放入For循环中,以便能够为数据集中的每个不同变量生成相似的摘要数据帧.我是使用循环的新手,但我一直认为它可能需要看起来像这样:

However, I'd like to be able to put this in a For loop to be able to produce similar summary dataframes for each of the different variables in the dataset. I'm new to using loops, but I was thinking it might need to look something like this:

vars <- list(colnames(iris))

for (i in vars) {

x <- iris %>%
  dplyr::group_by(Species) %>% 
  dplyr::summarise(min = min(i, na.rm = TRUE),
                   max = max(i, na.rm = TRUE),
                   median = median(i, na.rm = TRUE),
                   MAD = mad(i, na.rm = TRUE),
                   MAD_lowlim = median - (3 * MAD),
                   MAD_highlim = median + (3 * MAD),
                   Outliers_low = any(i < MAD_lowlim, na.rm = TRUE),
                   Outliers_high = any(i > MAD_highlim, na.rm = TRUE)
                   )

assign(paste(i, "Outlier_check", sep = "_"), x)

}

我知道这是行不通的,因为在汇总函数中 i 实际上没有引用任何数据.我不确定要使它正常工作需要做什么!非常感谢您的帮助,或者对如何更优雅地完成所有这些工作的任何建议.

I know that doesn't work though because in the summary functions i isn't actually referencing any data. I'm not sure what I need to do to make it work though! I'd be very grateful for your help, or any suggestions for how to accomplish all of this more elegantly.

我不愿意使用dplyr :: summarise_all(),因为它为所有变量输出一个汇总表,并且由于我正在处理的实际数据集包含许多变量,因此该汇总表将变得太大而无法轻松查看它.

I'm reluctant to use dplyr::summarise_all() because it outputs one summary table for all the variables, and as the real dataset I'm working on has many variables this summary table would become too large to be able to easily review it.

谢谢.

推荐答案

您还可以简单地通过 gather 非种类列来创建这些没有循环或单独功能的每个变量/种类摘要,分组和汇总:

You could also create these per-variable/species summaries without loops or separate functions, simply by gathering the non-Species columns, grouping, and summarizing:

library(tidyverse)

iris.summary <- iris %>% 
  gather(variable, value, -Species) %>% 
  group_by(variable, Species) %>% 
  summarize(
    min = min(value, na.rm = TRUE),
    max = max(value, na.rm = TRUE),
    median = median(value, na.rm = TRUE),
    MAD = mad(value, na.rm = TRUE),
    MAD_lowlim = median - (3 * MAD),
    MAD_highlim = median + (3 * MAD),
    Outliers_low = any(value < MAD_lowlim, na.rm = TRUE),
    Outliers_high = any(value > MAD_highlim, na.rm = TRUE)
  )

   variable     Species      min   max median   MAD MAD_lowlim MAD_highlim Outliers_low Outliers_high
   <chr>        <fct>      <dbl> <dbl>  <dbl> <dbl>      <dbl>       <dbl> <lgl>        <lgl>        
 1 Petal.Length setosa       1     1.9   1.5  0.148      1.06         1.94 TRUE         FALSE        
 2 Petal.Length versicolor   3     5.1   4.35 0.519      2.79         5.91 FALSE        FALSE        
 3 Petal.Length virginica    4.5   6.9   5.55 0.667      3.55         7.55 FALSE        FALSE        
 4 Petal.Width  setosa       0.1   0.6   0.2  0          0.2          0.2  TRUE         TRUE         
 5 Petal.Width  versicolor   1     1.8   1.3  0.222      0.633        1.97 FALSE        FALSE        
 6 Petal.Width  virginica    1.4   2.5   2    0.297      1.11         2.89 FALSE        FALSE        
 7 Sepal.Length setosa       4.3   5.8   5    0.297      4.11         5.89 FALSE        FALSE        
 8 Sepal.Length versicolor   4.9   7     5.9  0.519      4.34         7.46 FALSE        FALSE        
 9 Sepal.Length virginica    4.9   7.9   6.5  0.593      4.72         8.28 FALSE        FALSE        
10 Sepal.Width  setosa       2.3   4.4   3.4  0.371      2.29         4.51 FALSE        FALSE        
11 Sepal.Width  versicolor   2     3.4   2.8  0.297      1.91         3.69 FALSE        FALSE        
12 Sepal.Width  virginica    2.2   3.8   3    0.297      2.11         3.89 FALSE        FALSE   

这篇关于如何在R的For循环中正确使用group_by()和summarise()的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆