获取具有多个分组变量并排除自己的分组值的分组均值 [英] Get group mean with multiple grouping variables and excluding own group value

查看:143
本文介绍了获取具有多个分组变量并排除自己的分组值的分组均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种更快的方法来计算具有多个分组变量的分组平均值,同时排除自己的分组值.一项思想实验将是在同年从同一州的县中找到一个县的平均值(例如价格),而不包括自己县的价值.这是一个玩具数据集.

I'm looking for a faster way to calculate a group mean with multiple grouping variables while excluding own group values. A thought experiment would be finding average value (e.g. price) for a county from the counties in the same state in the same year excluding own county's value. Here's a toy data set.

df <- data_frame(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)

df %>%
  group_by(state, county, year) %>%
  summarise(q = mean(df$value[df$state == state & df$county != county & df$year == year]))

# Groups:   state, county [6]
   state county  year     q
   <chr> <chr>  <int> <dbl>
 1 AL    a       2011  56  
 2 AL    a       2012  46  
 3 AL    b       2011  50.5
 4 AL    b       2012  52  
 5 AL    c       2011  55.5
 6 AL    c       2012  29  
 7 CA    d       2011  52.5
 8 CA    d       2012  32  
 9 CA    e       2011  68.5
10 CA    e       2012  31.5
11 CA    f       2011  32  
12 CA    f       2012  42.5

上面的代码给了我想要的结果,但是当我将它应用于更大的数据集(具有更多的分组变量)时,它的速度确实变慢了.您对如何加快速度有任何建议吗?

The above code gives me a desired result, but when I apply this to a larger dataset (with more grouping variables) it gets really slow. Do you have any suggestion on how to speed this up?

如果原始方法不正确,请也指出这一点.

If the original approach is incorrect, please point that out as well.

推荐答案

library(dplyr)

df %>%
  group_by(state, year) %>%
  mutate(q = (sum(value) - value) / (n()-1))

#> # A tibble: 12 x 5
#> # Groups:   state, year [4]
#>    state county  year value     q
#>    <chr> <chr>  <int> <int> <dbl>
#>  1 AL    a       2011    68  30.5
#>  2 AL    a       2012    63  42  
#>  3 AL    b       2011    53  38  
#>  4 AL    b       2012    56  45.5
#>  5 AL    c       2011     8  60.5
#>  6 AL    c       2012    28  59.5
#>  7 CA    d       2011     7  40  
#>  8 CA    d       2012    69  41  
#>  9 CA    e       2011    39  24  
#> 10 CA    e       2012    79  36  
#> 11 CA    f       2011    41  23  
#> 12 CA    f       2012     3  74

数据:

Data:

#data_frame is deprecate!
df <- tibble(
  state = rep(c("AL", "CA"), each = 6),
  county = rep(letters[1:6], each = 2),
  year = rep(c(2011:2012), 6),
  value = sample.int(100, 12)
)

这篇关于获取具有多个分组变量并排除自己的分组值的分组均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆