在group_by()/summarize()循环中使用条件 [英] Using conditions in group_by()/summarize() loop

查看:148
本文介绍了在group_by()/summarize()循环中使用条件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的数据框(我还有很多年和变量):

I have a dataframe that looks something like this (I have a lot more years and variables):

Name    State2014     State2015  State2016  Tuition2014   Tuition2015  Tuition2016  StateGrants2014
Jared   CA            CA         MA         22430         23060        40650        5000
Beth    CA            CA         CA         36400         37050        37180        4200
Steven  MA            MA         MA         18010         18250        18720        NA
Lary    MA            CA         MA         24080         30800        24600        6600
Tom     MA            OR         OR         40450         15800        16040        NA
Alfred  OR            OR         OR         23570         23680        23750        3500
Cathy   OR            OR         OR         32070         32070        33040        4700

我的目标(在此示例中)是获取每个州的平均学费,以及每个州的州补助金之和.我的想法是按年份细分数据:

My objective (in this example) is to get the mean tuition for each state, and the sum of state grants for each state. My thought was to subset the data by year:

State2014     Tuition2014   StateGrants2014
CA            22430         5000
CA            36400         4200
MA            18010         NA
MA            24080         6600
MA            40450         NA
OR            23570         3500
OR            32070         4700

State2015  Tuition2015  
CA         23060        
CA         37050        
MA         18250        
CA         30800        
OR         15800        
OR         23680        
OR         32070       

State2016  Tuition2016  
MA         40650        
CA         37180        
MA         18720        
MA         24600        
OR         16040        
OR         23750        
OR         33040 

然后我将group_by状态和summarize(并将它们另存为单独的df)以获得以下内容:

Then I would group_by state and summarize (and save each as a separate df) to get the following:

State2014     Tuition2014   StateGrants2014
CA            29415         9200
MA            27513         6600
OR            27820         6600

State2015  Tuition2015  
CA         30303        
MA         18250        
OR         23850    

State2016  Tuition2016  
CA         37180        
MA         27990        
OR         24277        

然后我将按状态合并.这是我的代码:

Then I would merge the by state. Here is my code:

years = c(2014,2015,2016)
for (i in seq_along(years){
  #grab the variables from a certain year and save as a new df.
  df_year <- df[, grep(paste(years[[i]],"$",sep=""), colnames(df))]

  #Take off the year from each variable name (to make it easier to summarize)
  names(df_year) <- gsub(years[[i]], "", names(df_year), fixed = TRUE)

  df_year <- df_year %>%
    group_by(state) %>%
    summarize(Tuition = mean(Tuition, na.rm = TRUE),
            #this part of the code does not work. In this example, I only want to have this part if the year is 2016.
              if (years[[i]]=='2016')
                {Stategrant = mean(Stategrant, na.rm = TRUE)})

  #rename df_year to df####
  assign(paste("df",years[[i]],sep=''),df_year)
}

我有大约50年的数据,并且有很多变量,所以我想使用一个循环.所以我的问题是,如何在group_by()/summarize()函数中添加条件语句(汇总某些以年份为条件的变量)?谢谢!

I have about 50 years of data, and a good amount of variables, so I wanted to use a loop. So my question is, how do i add a conditional statement (summarize certain variables conditioned on the year) in the group_by()/summarize() function? Thanks!

* Edit:我意识到我可以将if{}移出函数,并执行类似的操作:

* I realize that I could take the if{} out of the function, and do something like:

  if (years[[i]]==2016){
      df_year <- df_year %>%
        group_by(state) %>%
        summarize(Tuition = mean(Tuition, na.rm = TRUE),
            Stategrant = mean(Stategrant, na.rm = TRUE))

      #rename df_year to df####
      assign(paste("df",years[[i]],sep=''),df_year)
  }

  else{
        df_year <- df_year %>%
            group_by(state) %>%
            summarize(Tuition = mean(Tuition, na.rm = TRUE))

          #rename df_year to df####
          assign(paste("df",years[[i]],sep=''),df_year)
  {
}

但是变量的组合是如此之多,以至于使用for循环将不是非常有效或有用.

but there are just so many combinations of variables, that using a for loop would not be very efficient or useful.

推荐答案

使用tidy数据要容易得多,所以让我向您展示如何整理数据.参见 http://r4ds.had.co.nz/tidy-data.html.

This is so much easier with tidy data, so let me show you how to tidy up your data. See http://r4ds.had.co.nz/tidy-data.html.

library(tidyr)
library(dplyr)

df <- gather(df, key, value, -Name) %>% 
  # separate years from the variables
  separate(key, c("var", "year"), sep = -5) %>% 
  # the above line splits up e.g. State2014 into State and 2014.
  # It does so by splitting at the fifth element from the end of the
  # entry. Please check that this works for your other variables
  # in case your naming conventions are inconsistent.
  spread(var, value) %>% 
  # turn numbers back to numeric
  mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
  gather(var, val, -Name, -year, -State) %>% 
  # group by the variables of interest. Note that `var` here 
  # refers to Tuition and StateGrants. If you have more variables,
  # they will be included here as well. If you want to exclude more
  # variables from being included here in `var`, add more "-colName" 
  # entries in the `gather` statement above
  group_by(year, State, var) %>% 
  # summarize:
  summarise(mean_values = mean(val))

这给您:

Source: local data frame [18 x 4]
Groups: year, State [?]
    year State         var mean_values
   <chr> <chr>       <chr>       <dbl>
1   2014    CA StateGrants     4600.00
2   2014    CA     Tuition    29415.00
3   2014    MA StateGrants          NA
4   2014    MA     Tuition    27513.33
5   2014    OR StateGrants     4100.00
6   2014    OR     Tuition    27820.00
7   2015    CA StateGrants          NA
8   2015    CA     Tuition    30303.33
9   2015    MA StateGrants          NA
10  2015    MA     Tuition    18250.00
11  2015    OR StateGrants          NA
12  2015    OR     Tuition    23850.00
13  2016    CA StateGrants          NA
14  2016    CA     Tuition    37180.00
15  2016    MA StateGrants          NA
16  2016    MA     Tuition    27990.00
17  2016    OR StateGrants          NA
18  2016    OR     Tuition    24276.67

如果您不喜欢这样的形状,可以例如在summarise语句后面添加%>% spread(var, mean_values),以在不同列中具有Tuition和StateGrants的方法.

If you don't like the shape of this, you can e.g. add an %>% spread(var, mean_values) behind the summarise statement to have the means for Tuition and StateGrants in different columns.

如果您要计算学费和补助金的不同函数(例如,学费和补助金总和的平均值,则可以执行以下操作:

If you want to compute different functions for Tuition and Grants (e.g. mean of Tuition and sum for grants, you could do the following:

df <- gather(df, key, value, -Name) %>% 
   separate(key, c("var", "year"), sep = -5) %>% 
   spread(var, value) %>% 
   mutate_at(.cols = c("Tuition", "StateGrants"), as.numeric) %>% 
   group_by(year, State) %>% 
   summarise(Grant_Sum = sum(StateGrants, na.rm=T), Tuition_Mean = mean(Tuition) )

这给您:

Source: local data frame [9 x 4]
Groups: year [?]

   year State Grant_Sum Tuition_Mean
  <chr> <chr>     <dbl>        <dbl>
1  2014    CA      9200     29415.00
2  2014    MA      6600     27513.33
3  2014    OR      8200     27820.00
4  2015    CA         0     30303.33
5  2015    MA         0     18250.00
6  2015    OR         0     23850.00
7  2016    CA         0     37180.00
8  2016    MA         0     27990.00
9  2016    OR         0     24276.67

请注意,我在这里使用sumna.rm = T,如果所有元素都是NA,则返回0.确保这对您的用例有意义.

Note that I used sum here, with na.rm = T, which returns 0 if all elements are NAs. Make sure this makes sense in your use case.

此外,要提及您的个人data.frames,您可以使用filter(year == 2014)等,如df_2014 <- filter(df, year == 2014)所示.

Also, just to mention it, to get your individual data.frames that you asked for, you can use filter(year == 2014) etc, as in df_2014 <- filter(df, year == 2014).

这篇关于在group_by()/summarize()循环中使用条件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆