平均数据按组 [英] Average data by group

查看:97
本文介绍了平均数据按组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似于这样的大数据框:

  df<  -  data.frame(dive = factor sample(c(dive1,dive2),10,replace = TRUE)),speed = runif(10))
> df
潜水速度
1潜水1 0.80668490
2潜水1 0.53349584
3潜水2 0.07571784
4潜水2 0.39518628
5潜水1 0.84557955
6潜水1 0.69121443
7 dive1 0.38124950
8 dive2 0.22536126
9 dive1 0.04704750
10 dive2 0.93561651

我的目标是在另一列等于某个值时平均一列的值,并对所有值重复此列。即在上面的例子中,我想为列潜水的每个唯一值返回平均值速度 。所以当 dive == dive1 时,速度的平均值是这样的,每个值 dive

解决方案

有很多方法可以在R.具体来说,聚合 split plyr cast tapply data.table dplyr 等等。



广义而言,这些问题是分裂的形式应用组合。哈德利·威克姆(Hadley Wickham)撰写了一份美丽的文章,让您深入了解整个类别的问题,值得一读。他的 plyr 包实现了一般数据结构的策略,而 dplyr 是针对数据帧调整的较新的实现性能。它们允许解决相同形式的问题,但是比这个更复杂的问题。他们非常值得学习,作为解决数据操作问题的一般工具。



性能是非常大的数据集的一个问题,因为它是很难击败基于 data.table 。但是,如果您只处理中等数据集或更小的数据集,请花时间学习 data.table 可能不值得。 dplyr 也可以很快,所以如果你想加快速度,但不太需要数据的可扩展性,这是一个不错的选择。表



以下许多其他解决方案不需要任何其他软件包。其中一些在中等数据集上甚至相当快。他们的主要缺点是隐喻或灵活性之一。通过比喻,我的意思是,它是一种专为其他被强迫以聪明的方式解决这种特殊类型的问题的工具。通过灵活性,我的意思是他们缺乏解决广泛的类似问题或容易产生整洁产出的能力。






例如





用户友好的形式,它承载向量并对它们应用功能。但是,它的输出不是很可操作的形式。

  res.by<  -  by(df $ speed,df $潜水,平均)
res.by
#df $ dive:dive1
#[1] 0.5790946
#--------------- ------------------------
#df $ dive:dive2
#[1] 0.4864489

为了解决这个问题,为了简单的使用 <$ c $ taRifx 库中的c> as.data.frame 方法:

 库(taRifx)
as.data.frame(res.by)
#IDX1值
#1 dive1 0.6736807
#2 dive2 0.4051447

聚合



聚合在基础R中。它接收data.frames,输出data.frames,并使用公式界面。

  aggregate(speed〜dive,df,mean)
#dive speed
#1 dive1 0.5790946
#2 dive2 0.4864489

split



split 也在基数R中。顾名思义,它只执行split-apply-combine策略的split部分。为了使休息工作,我将编写一个使用 sapply 的小功能来应用组合。 sapply 尽可能自动简化结果。在我们的例子中,这意味着一个向量而不是一个data.frame,因为我们只有一个维度的结果。

  splitmean<  -  function(df){
s< - split(df,df $ dive)
sapply(s,function(x)mean(x $ speed))
}
splitmean(df)
#dive1 dive2
#0.5790946 0.4864489

plyr



以下是官方页面不得不说 plyr


已经可以使用base R函数(如split和
应用函数系列)来实现此功能,但是plyr使它更容易




  • 完全一致的名称,参数和输出

  • 方便的并行化通过foreach包

  • 从数据输入和输出到数据。框架,矩阵和列表

  • 进度条来跟踪l ong运行操作

  • 内置错误恢复和信息错误消息

  • 所有转换中维护的标签


换句话说,如果你学习一个分割应用组合操作的工具,它应该是 plyr

  library(plyr)
res.plyr< - ddply(df, (dive),function(x)mean(x $ speed))
res.plyr
#dive V1
#1 dive1 0.5790946
#2 dive2 0.4864489

reshape2



reshape2 库不是以split-apply-combine为主要设计。相反,它使用两部分的融合/投放策略来展示各种各样的数据重组任务。然而,由于它允许聚合功能,因此可以用于此问题。它不会成为我们拆分应用组合操作的首选,但是它的重新组合功能非常强大,因此你也应该学习这个软件包。

  library(reshape2)
dcast(melt(df),variable〜dive,mean)
#使用潜水作为ID变量
#变量dive1 dive2
#1速度0.5790946 0.4864489

data.table

  library(data.table)
dt< - data.table(df)
setkey(dt,dive)
dt [,mean(speed),by = dive]
#dive V1
#[1,] dive1 0.5790946
#[2,] dive2 0.4864489

dplyr

  library(dplyr)
group_by(df,dive)%>%summaryize(m = )

#dplyr也可以处理数据表:
group_by(dt,dive)%>%summaryize(m = mean(speed))






基准



10行,2组



 库(微基准)
m1 < - microbenchmark(
by(df $ speed ,df $ dive,mean),
aggregate(speed〜dive,df, (df,。(dive),function(x)mean(x $ speed)),
dcast(melt(df),variable〜潜水,平均),
dt [,平均(速度),by = dive],
总结(group_by(df,dive),m =平均(速度)),
总结(group_by (dt,dive),m =平均(速度))


> print(m1,signif = 3)
单位:微秒
expr min lq平均值uq max neval cld
由(df $ speed,df $ dive,mean)302 325 343.9 342 362 396 100 b
aggregate(speed〜dive,df,mean)904 966 1012.1 1020 1060 1130 100 e
splitmean(df)191 206 249.9 220 232 1670 100 a
ddply(df,潜水),function(x)mean(x $ speed))1220 1310 1358.1 1340 1380 2740 100 f
dcast(melt(df),variable〜dive,mean)2150 2330 2440.7 2430 2490 4010 100 h
dt [,mean(speed),by = dive] 599 629 667.1 659 704 771 100 c
总结(group_by(df,dive),m =平均(速度))663 710 774.6 744 782 2140 100 d
总结(group_by(dt,dive),m =平均(速度))1860 1960 2051.0 2020 2090 3430 100 g

autoplot(m1)



像往常一样, data.table 有更多的开销,所以关于小数据集的平均值。这些微秒虽然微不足道,任何方法都可以正常工作,您应该选择:




  • 您已经熟悉或想要熟悉的内容( plyr 始终值得学习,因为它的灵活性;如果您计划分析巨大的数据集, data.table 值得学习; 聚合 split 都是基本的R函数,因此普遍可用)

  • 它返回什么输出(numeric,data.frame或data.table - 后者从data.frame继承)



1000万行,10组



但是,如果我们有一个大数据集怎么办?让我们试试10 ^ 7行分成十组。

  df<  -  data.frame(dive = factor [1:10],10 ^ 7,replace = TRUE)),speed = runif(10 ^ 7))
dt< - data.table(df)
setkey(dt,dive)

m2< - microbenchmark(
by(df $ speed,df $ dive,mean),
aggregate(speed〜dive,df,mean),
splitmean (df),
ddply(df,。(dive),function(x)mean(x $ speed)),
dcast(melt(df),variable〜dive,mean),
dt [,mean(speed),by = dive],
times = 2


> print(m2,signif = 3)
单位:毫秒
expr最小lq平均值uq max neval cld
由(df $ speed,df $ dive,mean)720 770 799.1 791 816 958 100 d
aggregate(speed〜dive,df,mean)10900 11000 11027.0 11000 11100 11300 100 h
splitmean(df)974 1040 1074.1 1060 1100 1280 100 e
ddply(df,潜水),function(x)mean(x $ speed))1050 1080 1110.4 1100 1130 1260 100 f
dcast(melt(df),variable〜dive,mean)2360 2450 2492.8 2490 2520 2620 100 g
dt [,mean(speed),by = dive] 119 120 126.2 120 122 212 100 a
summary(group_by(df,dive),m = mean(speed))517 521 531.0 522 532 620 100 c
总结(group_by(dt,dive),m =平均(速度))154 155 174.0 156 189 321 100 b

autoplot(m2)



然后 data.table dplyr 使用操作在 data.table 是显然的方式。某些方法(聚合 dcast )开始看起来很慢。



1000万行,1,000组



如果您有更多的组,差异变得更加显着。使用 1,000组和相同的10 ^ 7行:

  df<  -  data.frame dive = factor(sample(seq(1000),10 ^ 7,replace = TRUE)),speed = runif(10 ^ 7))
dt< - data.table(df)
setkey dt,dive)

#然后运行与上述相同的微基准
print(m3,signif = 3)
单位:毫秒
expr min lq平均值uq max (df $ speed,df $ dive,mean)776 791 816.2 810 828 925 100 b
aggregate(speed〜dive,df,mean)11200 11400 11460.2 11400 11500 12000 100 f
splitmean(df)5940 6450 7562.4 7470 8370 11200 100 e
ddply(df,。(dive),function(x)mean(x $ speed))1220 1250 1279.1 1280 1300 1440 100 c
dcast(ff(df),variable〜dive,mean)2110 2190 2267.8 2250 2290 2750 100 d
dt [,mean(speed) ,= = dive] 110 111 113.5 111 113 143 100 a
summary(group_by(df,dive),m = mean(speed))625 630 637.1 633 644 701 100 b
summary(group_by ,潜水),m =平均(速度))129 130 137.3 131 142 213 100 a

autoplot(m3)



所以 data.table 继续扩展, dplyr data.table 也可以正常工作,而 dplyr data.frame 幅度较慢。 split / sapply 策略似乎在组的数量上不足(意思是 split ()可能很慢,而 sapply 是快速的)。 by 仍然比较有效率 - 在5秒钟内,用户绝对值得注意,但是对于数据集而言,这仍然是不合理的。不过,如果您经常使用这种大小的数据集, data.table 显然是要走的 - 100%的data.table以获得最佳性能或 dplyr dplyr 使用 data.table 作为一个可行的替代方案。 p>

I have a large data frame looking similar to this:

df <- data.frame(dive=factor(sample(c("dive1","dive2"),10,replace=TRUE)),speed=runif(10))
> df
    dive      speed
1  dive1 0.80668490
2  dive1 0.53349584
3  dive2 0.07571784
4  dive2 0.39518628
5  dive1 0.84557955
6  dive1 0.69121443
7  dive1 0.38124950
8  dive2 0.22536126
9  dive1 0.04704750
10 dive2 0.93561651

My goal is to average the values of one column when another column is equal to a certain value, and repeat this for all values. i.e. in the example above I would like to return an average for the column speed for every unique value of the column dive. So when dive==dive1, the average for speed is this and so on for each value of dive.

解决方案

There are many ways to do this in R. Specifically, by, aggregate, split, and plyr, cast, tapply, data.table, dplyr, and so forth.

Broadly speaking, these problems are of the form split-apply-combine. Hadley Wickham has written a beautiful article that will give you deeper insight into the whole category of problems, and it is well worth reading. His plyr package implements the strategy for general data structures, and dplyr is a newer implementation performance tuned for data frames. They allow for solving problems of the same form but of even greater complexity than this one. They are well worth learning as a general tool for solving data manipulation problems.

Performance is an issue on very large datasets, and for that it is hard to beat solutions based on data.table. If you only deal with medium-sized datasets or smaller, however, taking the time to learn data.table is likely not worth the effort. dplyr can also be fast, so it is a good choice if you want to speed things up, but don't quite need the scalability of data.table.

Many of the other solutions below do not require any additional packages. Some of them are even fairly fast on medium-large datasets. Their primary disadvantage is either one of metaphor or of flexibility. By metaphor I mean that it is a tool designed for something else being coerced to solve this particular type of problem in a 'clever' way. By flexibility I mean they lack the ability to solve as wide a range of similar problems or to easily produce tidy output.


Examples

With by:

By is in base R. In its most user-friendly form, it takes in vectors and applies a function to them. However, its output is not in a very manipulable form.

res.by <- by( df$speed, df$dive, mean)
res.by
# df$dive: dive1
# [1] 0.5790946
# ---------------------------------------
# df$dive: dive2
# [1] 0.4864489

To get around this, for simple uses of by the as.data.frame method in the taRifx library works:

library(taRifx)
as.data.frame(res.by)
#    IDX1     value
# 1 dive1 0.6736807
# 2 dive2 0.4051447

Or aggregate:

aggregate is in base R. It takes in data.frames, outputs data.frames, and uses a formula interface.

aggregate( speed ~ dive, df, mean )
#    dive     speed
# 1 dive1 0.5790946
# 2 dive2 0.4864489

Or split:

split is also in base R. As the name suggests, it performs only the "split" part of the split-apply-combine strategy. To make the rest work, I'll write a small function that uses sapply for apply-combine. sapply automatically simplifies the result as much as possible. In our case, that means a vector rather than a data.frame, since we've got only 1 dimension of results.

splitmean <- function(df) {
  s <- split( df, df$dive)
  sapply( s, function(x) mean(x$speed) )
}
splitmean(df)
#     dive1     dive2 
# 0.5790946 0.4864489 

Or plyr:

Here's what the official page has to say about plyr:

It’s already possible to do this with base R functions (like split and the apply family of functions), but plyr makes it all a bit easier with:

  • totally consistent names, arguments and outputs
  • convenient parallelisation through the foreach package
  • input from and output to data.frames, matrices and lists
  • progress bars to keep track of long running operations
  • built-in error recovery, and informative error messages
  • labels that are maintained across all transformations

In other words, if you learn one tool for split-apply-combine manipulation it should be plyr.

library(plyr)
res.plyr <- ddply( df, .(dive), function(x) mean(x$speed) )
res.plyr
#    dive        V1
# 1 dive1 0.5790946
# 2 dive2 0.4864489

Or reshape2:

The reshape2 library is not designed with split-apply-combine as its primary focus. Instead, it uses a two-part melt/cast strategy to perform a wide variety of data reshaping tasks. However, since it allows an aggregation function it can be used for this problem. It would not be my first choice for split-apply-combine operations, but its reshaping capabilities are powerful and thus you should learn this package as well.

library(reshape2)
dcast( melt(df), variable ~ dive, mean)
# Using dive as id variables
#   variable     dive1     dive2
# 1    speed 0.5790946 0.4864489

Or data.table:

library(data.table)
dt <- data.table(df)
setkey(dt,dive)
dt[,mean(speed),by=dive]
#       dive        V1
# [1,] dive1 0.5790946
# [2,] dive2 0.4864489

Or dplyr:

library(dplyr)
group_by(df, dive) %>% summarize(m = mean(speed))

# dplyr can also work on data tables:
group_by(dt, dive) %>% summarize(m = mean(speed))


Benchmarks

10 rows, 2 groups

library(microbenchmark)
m1 <- microbenchmark(
  by( df$speed, df$dive, mean),
  aggregate( speed ~ dive, df, mean ),
  splitmean(df),
  ddply( df, .(dive), function(x) mean(x$speed) ),
  dcast( melt(df), variable ~ dive, mean),
  dt[, mean(speed), by = dive],
  summarize( group_by(df, dive), m = mean(speed) ),
  summarize( group_by(dt, dive), m = mean(speed) )
)

> print(m1, signif = 3)
Unit: microseconds
                                           expr  min   lq   mean median   uq  max neval      cld
                    by(df$speed, df$dive, mean)  302  325  343.9    342  362  396   100  b      
              aggregate(speed ~ dive, df, mean)  904  966 1012.1   1020 1060 1130   100     e   
                                  splitmean(df)  191  206  249.9    220  232 1670   100 a       
  ddply(df, .(dive), function(x) mean(x$speed)) 1220 1310 1358.1   1340 1380 2740   100      f  
         dcast(melt(df), variable ~ dive, mean) 2150 2330 2440.7   2430 2490 4010   100        h
                   dt[, mean(speed), by = dive]  599  629  667.1    659  704  771   100   c     
 summarize(group_by(df, dive), m = mean(speed))  663  710  774.6    744  782 2140   100    d    
 summarize(group_by(dt, dive), m = mean(speed)) 1860 1960 2051.0   2020 2090 3430   100       g 

autoplot(m1)

As usual, data.table has a little more overhead so comes in about average for small datasets. These are microseconds, though, so the differences are trivial. Any of the approaches works fine here, and you should choose based on:

  • What you're already familiar with or want to be familiar with (plyr is always worth learning for its flexibility; data.table is worth learning if you plan to analyze huge datasets; by and aggregate and split are all base R functions and thus universally available)
  • What output it returns (numeric, data.frame, or data.table -- the latter of which inherits from data.frame)

10 million rows, 10 groups

But what if we have a big dataset? Let's try 10^7 rows split over ten groups.

df <- data.frame(dive=factor(sample(letters[1:10],10^7,replace=TRUE)),speed=runif(10^7))
dt <- data.table(df)
setkey(dt,dive)

m2 <- microbenchmark(
  by( df$speed, df$dive, mean),
  aggregate( speed ~ dive, df, mean ),
  splitmean(df),
  ddply( df, .(dive), function(x) mean(x$speed) ),
  dcast( melt(df), variable ~ dive, mean),
  dt[,mean(speed),by=dive],
  times=2
)

> print(m2, signif = 3)
Unit: milliseconds
                                           expr   min    lq    mean median    uq   max neval      cld
                    by(df$speed, df$dive, mean)   720   770   799.1    791   816   958   100    d    
              aggregate(speed ~ dive, df, mean) 10900 11000 11027.0  11000 11100 11300   100        h
                                  splitmean(df)   974  1040  1074.1   1060  1100  1280   100     e   
  ddply(df, .(dive), function(x) mean(x$speed))  1050  1080  1110.4   1100  1130  1260   100      f  
         dcast(melt(df), variable ~ dive, mean)  2360  2450  2492.8   2490  2520  2620   100       g 
                   dt[, mean(speed), by = dive]   119   120   126.2    120   122   212   100 a       
 summarize(group_by(df, dive), m = mean(speed))   517   521   531.0    522   532   620   100   c     
 summarize(group_by(dt, dive), m = mean(speed))   154   155   174.0    156   189   321   100  b      

autoplot(m2)

Then data.table or dplyr using operating on data.tables is clearly the way to go. Certain approaches (aggregate and dcast) are beginning to look very slow.

10 million rows, 1,000 groups

If you have more groups, the difference becomes more pronounced. With 1,000 groups and the same 10^7 rows:

df <- data.frame(dive=factor(sample(seq(1000),10^7,replace=TRUE)),speed=runif(10^7))
dt <- data.table(df)
setkey(dt,dive)

# then run the same microbenchmark as above
print(m3, signif = 3)
Unit: milliseconds
                                           expr   min    lq    mean median    uq   max neval    cld
                    by(df$speed, df$dive, mean)   776   791   816.2    810   828   925   100  b    
              aggregate(speed ~ dive, df, mean) 11200 11400 11460.2  11400 11500 12000   100      f
                                  splitmean(df)  5940  6450  7562.4   7470  8370 11200   100     e 
  ddply(df, .(dive), function(x) mean(x$speed))  1220  1250  1279.1   1280  1300  1440   100   c   
         dcast(melt(df), variable ~ dive, mean)  2110  2190  2267.8   2250  2290  2750   100    d  
                   dt[, mean(speed), by = dive]   110   111   113.5    111   113   143   100 a     
 summarize(group_by(df, dive), m = mean(speed))   625   630   637.1    633   644   701   100  b    
 summarize(group_by(dt, dive), m = mean(speed))   129   130   137.3    131   142   213   100 a     

autoplot(m3)

So data.table continues scaling well, and dplyr operating on a data.table also works well, with dplyr on data.frame close to an order of magnitude slower. The split/sapply strategy seems to scale poorly in the number of groups (meaning the split() is likely slow and the sapply is fast). by continues to be relatively efficient--at 5 seconds, it's definitely noticeable to the user but for a dataset this large still not unreasonable. Still, if you're routinely working with datasets of this size, data.table is clearly the way to go - 100% data.table for the best performance or dplyr with dplyr using data.table as a viable alternative.

这篇关于平均数据按组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆