按组计算平均值 [英] Calculate the mean by group

查看:122
本文介绍了按组计算平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个看起来像这样的大数据框:

I have a large data frame looking similar to this:

df <- data.frame(dive=factor(sample(c("dive1","dive2"),10,replace=TRUE)),speed=runif(10))
> df
    dive      speed
1  dive1 0.80668490
2  dive1 0.53349584
3  dive2 0.07571784
4  dive2 0.39518628
5  dive1 0.84557955
6  dive1 0.69121443
7  dive1 0.38124950
8  dive2 0.22536126
9  dive1 0.04704750
10 dive2 0.93561651

我的目标是在另一列等于某个值时对一列的值求平均值,并对所有值重复此操作.即在上面的示例中,我想为列dive的每个唯一值返回列speed的平均值.因此,当dive==dive1时,speed的平均值就是这个,依此类推,对于dive的每个值.

My goal is to average the values of one column when another column is equal to a certain value and repeat this for all values. i.e. in the example above I would like to return an average for the column speed for every unique value of the column dive. So when dive==dive1, the average for speed is this and so on for each value of dive.

推荐答案

R中有很多方法可以做到这一点.具体来说,byaggregatesplitplyrcast tapplydata.tabledplyr等.

There are many ways to do this in R. Specifically, by, aggregate, split, and plyr, cast, tapply, data.table, dplyr, and so forth.

广义上讲,这些问题的形式为split-apply-combine.哈德利·威克姆(Hadley Wickham)撰写了美丽的文章,它将使您对整个问题类别有更深入的了解. ,非常值得一读.他的plyr包实现了通用数据结构的策略,而dplyr是针对数据帧调整的更新的实现性能.它们允许解决相同形式但比此形式复杂得多的问题.作为解决数据操作问题的通用工具,它们非常值得学习.

Broadly speaking, these problems are of the form split-apply-combine. Hadley Wickham has written a beautiful article that will give you deeper insight into the whole category of problems, and it is well worth reading. His plyr package implements the strategy for general data structures, and dplyr is a newer implementation performance tuned for data frames. They allow for solving problems of the same form but of even greater complexity than this one. They are well worth learning as a general tool for solving data manipulation problems.

性能是非常大的数据集上的一个问题,因此很难击败基于data.table的解决方案.但是,如果仅处理中型或更小的数据集,则花时间学习data.table可能不值得. dplyr也可以是快速的,因此,如果您想加快速度,但是它并不是非常需要data.table的可伸缩性,这是一个不错的选择.

Performance is an issue on very large datasets, and for that it is hard to beat solutions based on data.table. If you only deal with medium-sized datasets or smaller, however, taking the time to learn data.table is likely not worth the effort. dplyr can also be fast, so it is a good choice if you want to speed things up, but don't quite need the scalability of data.table.

下面的许多其他解决方案不需要任何其他软件包.在中等规模的数据集上,其中一些甚至相当快.它们的主要缺点是隐喻或灵活性之一.隐喻的意思是,它是为其他被迫以聪明"方式解决这种特殊类型问题而设计的工具.灵活性是指他们缺乏解决各种相似问题或轻松产生整洁输出的能力.

Many of the other solutions below do not require any additional packages. Some of them are even fairly fast on medium-large datasets. Their primary disadvantage is either one of metaphor or of flexibility. By metaphor I mean that it is a tool designed for something else being coerced to solve this particular type of problem in a 'clever' way. By flexibility I mean they lack the ability to solve as wide a range of similar problems or to easily produce tidy output.

tapply:

tapply:

tapply(df$speed, df$dive, mean)
#     dive1     dive2 
# 0.5419921 0.5103974

aggregate :

aggregate:

aggregate接收data.frames,输出data.frames,并使用公式接口.

aggregate takes in data.frames, outputs data.frames, and uses a formula interface.

aggregate( speed ~ dive, df, mean )
#    dive     speed
# 1 dive1 0.5790946
# 2 dive2 0.4864489

by :

by:

以其最易于使用的形式,它接受矢量并将函数应用于这些矢量.但是,它的输出不是非常可操作的形式.

In its most user-friendly form, it takes in vectors and applies a function to them. However, its output is not in a very manipulable form.:

res.by <- by(df$speed, df$dive, mean)
res.by
# df$dive: dive1
# [1] 0.5790946
# ---------------------------------------
# df$dive: dive2
# [1] 0.4864489

要解决此问题,对于by的简单使用,taRifx库中的as.data.frame方法有效:

To get around this, for simple uses of by the as.data.frame method in the taRifx library works:

library(taRifx)
as.data.frame(res.by)
#    IDX1     value
# 1 dive1 0.6736807
# 2 dive2 0.4051447

split :

split:

顾名思义,它仅执行split-apply-combine策略的"split"部分.为了使其余工作正常进行,我将编写一个小函数,将sapply用于apply-combine. sapply会自动尽可能地简化结果.在我们的例子中,这意味着一个向量而不是一个data.frame,因为我们只有一维的结果.

As the name suggests, it performs only the "split" part of the split-apply-combine strategy. To make the rest work, I'll write a small function that uses sapply for apply-combine. sapply automatically simplifies the result as much as possible. In our case, that means a vector rather than a data.frame, since we've got only 1 dimension of results.

splitmean <- function(df) {
  s <- split( df, df$dive)
  sapply( s, function(x) mean(x$speed) )
}
splitmean(df)
#     dive1     dive2 
# 0.5790946 0.4864489 


外部软件包

数据表:


External packages

data.table:

library(data.table)
setDT(df)[ , .(mean_speed = mean(speed)), by = dive]
#    dive mean_speed
# 1: dive1  0.5419921
# 2: dive2  0.5103974

dplyr :

dplyr:

library(dplyr)
group_by(df, dive) %>% summarize(m = mean(speed))

plyr (dplyr的前光标)

这是官方页面关于plyr的说法:

已经可以使用base R函数(例如split和 (apply函数系列),但是plyr使这一切变得更加容易 与:

It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

  • 完全一致的名称,参数和输出
  • 通过foreach
  • 方便的并行化
  • 输入,输出到数据,框架,矩阵和列表
  • 进度条,用于跟踪长时间运行的操作
  • 内置错误恢复和信息错误消息
  • 在所有转换中维护的
  • 标签
  • totally consistent names, arguments and outputs
  • convenient parallelisation through the foreach package
  • input from and output to data.frames, matrices and lists
  • progress bars to keep track of long running operations
  • built-in error recovery, and informative error messages
  • labels that are maintained across all transformations

换句话说,如果您学习一种用于拆分应用组合操作的工具,则应为plyr.

In other words, if you learn one tool for split-apply-combine manipulation it should be plyr.

library(plyr)
res.plyr <- ddply( df, .(dive), function(x) mean(x$speed) )
res.plyr
#    dive        V1
# 1 dive1 0.5790946
# 2 dive2 0.4864489

reshape2 :

reshape2库不是以split-apply-combine为主要焦点而设计的.取而代之的是,它使用两部分的熔化/投射策略来执行各种各样的数据重塑任务.但是,由于它允许聚合功能,因此可以用于此问题.对于拆分应用组合操作,这不是我的首选,但是它的重塑功能很强大,因此您也应该学习此程序包.

The reshape2 library is not designed with split-apply-combine as its primary focus. Instead, it uses a two-part melt/cast strategy to perform a wide variety of data reshaping tasks. However, since it allows an aggregation function it can be used for this problem. It would not be my first choice for split-apply-combine operations, but its reshaping capabilities are powerful and thus you should learn this package as well.

library(reshape2)
dcast( melt(df), variable ~ dive, mean)
# Using dive as id variables
#   variable     dive1     dive2
# 1    speed 0.5790946 0.4864489


基准

10行,2组


Benchmarks

10 rows, 2 groups

library(microbenchmark)
m1 <- microbenchmark(
  by( df$speed, df$dive, mean),
  aggregate( speed ~ dive, df, mean ),
  splitmean(df),
  ddply( df, .(dive), function(x) mean(x$speed) ),
  dcast( melt(df), variable ~ dive, mean),
  dt[, mean(speed), by = dive],
  summarize( group_by(df, dive), m = mean(speed) ),
  summarize( group_by(dt, dive), m = mean(speed) )
)

> print(m1, signif = 3)
Unit: microseconds
                                           expr  min   lq   mean median   uq  max neval      cld
                    by(df$speed, df$dive, mean)  302  325  343.9    342  362  396   100  b      
              aggregate(speed ~ dive, df, mean)  904  966 1012.1   1020 1060 1130   100     e   
                                  splitmean(df)  191  206  249.9    220  232 1670   100 a       
  ddply(df, .(dive), function(x) mean(x$speed)) 1220 1310 1358.1   1340 1380 2740   100      f  
         dcast(melt(df), variable ~ dive, mean) 2150 2330 2440.7   2430 2490 4010   100        h
                   dt[, mean(speed), by = dive]  599  629  667.1    659  704  771   100   c     
 summarize(group_by(df, dive), m = mean(speed))  663  710  774.6    744  782 2140   100    d    
 summarize(group_by(dt, dive), m = mean(speed)) 1860 1960 2051.0   2020 2090 3430   100       g 

autoplot(m1)

与往常一样,data.table具有更多的开销,因此对于小型数据集来说大约是平均水平.但是,这些时间是微秒,因此差异很小.任何一种方法在这里都可以正常工作,您应该根据以下条件进行选择:

As usual, data.table has a little more overhead so comes in about average for small datasets. These are microseconds, though, so the differences are trivial. Any of the approaches works fine here, and you should choose based on:

  • 您已经熟悉或想要熟悉的内容(plyr的灵活性总是值得学习的; data.table如果您打算分析庞大的数据集,则值得学习; byaggregate以及split都是基本的R函数,因此可以普遍使用)
  • 返回的输出(数字,data.frame或data.table-后者从data.frame继承)
  • What you're already familiar with or want to be familiar with (plyr is always worth learning for its flexibility; data.table is worth learning if you plan to analyze huge datasets; by and aggregate and split are all base R functions and thus universally available)
  • What output it returns (numeric, data.frame, or data.table -- the latter of which inherits from data.frame)

但是,如果我们有一个庞大的数据集怎么办?让我们尝试将10 ^ 7行分成十组.

But what if we have a big dataset? Let's try 10^7 rows split over ten groups.

df <- data.frame(dive=factor(sample(letters[1:10],10^7,replace=TRUE)),speed=runif(10^7))
dt <- data.table(df)
setkey(dt,dive)

m2 <- microbenchmark(
  by( df$speed, df$dive, mean),
  aggregate( speed ~ dive, df, mean ),
  splitmean(df),
  ddply( df, .(dive), function(x) mean(x$speed) ),
  dcast( melt(df), variable ~ dive, mean),
  dt[,mean(speed),by=dive],
  times=2
)

> print(m2, signif = 3)
Unit: milliseconds
                                           expr   min    lq    mean median    uq   max neval      cld
                    by(df$speed, df$dive, mean)   720   770   799.1    791   816   958   100    d    
              aggregate(speed ~ dive, df, mean) 10900 11000 11027.0  11000 11100 11300   100        h
                                  splitmean(df)   974  1040  1074.1   1060  1100  1280   100     e   
  ddply(df, .(dive), function(x) mean(x$speed))  1050  1080  1110.4   1100  1130  1260   100      f  
         dcast(melt(df), variable ~ dive, mean)  2360  2450  2492.8   2490  2520  2620   100       g 
                   dt[, mean(speed), by = dive]   119   120   126.2    120   122   212   100 a       
 summarize(group_by(df, dive), m = mean(speed))   517   521   531.0    522   532   620   100   c     
 summarize(group_by(dt, dive), m = mean(speed))   154   155   174.0    156   189   321   100  b      

autoplot(m2)

然后使用data.table上的data.tabledplyr显然是可行的方法.某些方法(aggregatedcast)开始显得很慢.

Then data.table or dplyr using operating on data.tables is clearly the way to go. Certain approaches (aggregate and dcast) are beginning to look very slow.

如果您有更多的组,则差异会更加明显.具有 1,000个组和相同的10 ^ 7行:

If you have more groups, the difference becomes more pronounced. With 1,000 groups and the same 10^7 rows:

df <- data.frame(dive=factor(sample(seq(1000),10^7,replace=TRUE)),speed=runif(10^7))
dt <- data.table(df)
setkey(dt,dive)

# then run the same microbenchmark as above
print(m3, signif = 3)
Unit: milliseconds
                                           expr   min    lq    mean median    uq   max neval    cld
                    by(df$speed, df$dive, mean)   776   791   816.2    810   828   925   100  b    
              aggregate(speed ~ dive, df, mean) 11200 11400 11460.2  11400 11500 12000   100      f
                                  splitmean(df)  5940  6450  7562.4   7470  8370 11200   100     e 
  ddply(df, .(dive), function(x) mean(x$speed))  1220  1250  1279.1   1280  1300  1440   100   c   
         dcast(melt(df), variable ~ dive, mean)  2110  2190  2267.8   2250  2290  2750   100    d  
                   dt[, mean(speed), by = dive]   110   111   113.5    111   113   143   100 a     
 summarize(group_by(df, dive), m = mean(speed))   625   630   637.1    633   644   701   100  b    
 summarize(group_by(dt, dive), m = mean(speed))   129   130   137.3    131   142   213   100 a     

autoplot(m3)

因此,data.table可以继续很好地缩放,并且dplyrdata.table上运行也可以很好地工作,而data.frame上的dplyr的速度要慢一个数量级. split/sapply策略在组数上的伸缩性似乎很差(这意味着split()可能很慢,而sapply很快). by仍然相对有效-在5秒钟内,用户肯定会注意到它,但是对于如此庞大的数据集而言,这仍然是不合理的.不过,如果您经常使用这种大小的数据集,显然data.table是最好的选择-100%data.table以获得最佳性能,或者dplyrdplyr配合使用data.table作为可行的选择.

So data.table continues scaling well, and dplyr operating on a data.table also works well, with dplyr on data.frame close to an order of magnitude slower. The split/sapply strategy seems to scale poorly in the number of groups (meaning the split() is likely slow and the sapply is fast). by continues to be relatively efficient--at 5 seconds, it's definitely noticeable to the user but for a dataset this large still not unreasonable. Still, if you're routinely working with datasets of this size, data.table is clearly the way to go - 100% data.table for the best performance or dplyr with dplyr using data.table as a viable alternative.

这篇关于按组计算平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆