用聚合解决ddply任务的绝佳方法(希望获得更好的性能) [英] Elegant way to solve ddply task with aggregate (hoping for better performance)

查看:72
本文介绍了用聚合解决ddply任务的绝佳方法(希望获得更好的性能)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过一个名为ensg的标识符变量来汇总data.frame.数据框如下所示:

I would like to aggregate a data.frame by an identifier variable called ensg. The data frame looks like this:

  chromosome probeset               ensg symbol    XXA_00    XXA_36    XXB_00
1          X  4938842 ENSMUSG00000000003   Pbsn  4.796123  4.737717  5.326664

我想计算具有相同ensg值的行上每个数字列的平均值.这里的问题是,我想保留其他身份变量染色体和符号,因为它们对于相同的ensg也是相同的.

I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg.

最后,我想创建一个data.frame,其标识列为chromosomeensgsymbol,并且具有相同标识符的行上的数字列均值.我在ddply中实现了此功能,但与aggregate相比,它非常慢:

In the end I would like to have a data.frame with identity columns chromosome, ensg, symbol and mean of numeric columns over rows with same identifier. I implemented this in ddply, but it is very slow when compared to aggregate:

spec.mean <- function(eset.piece)
  {
    cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
  }
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")

我的第一个汇总实现看起来像这样,

My first aggregate implementation looks like this,

mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);

,并且速度更快.但是aggregate的问题是我必须重新附加描述变量.我还没有弄清楚如何将自定义函数与aggregate一起使用,因为aggregate不会传递数据帧,而只会传递矢量.

and is much faster. But the problem with aggregate is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate since aggregate does not pass data frames but only vectors.

使用aggregate可以做到这一点吗?还是有一些更快的方法可以使用ddply做到这一点?

Is there an elegant way to do this with aggregate? Or is there some faster way to do it with ddply?

推荐答案

首先让我们定义一个玩具示例:

First let's define a toy example:

df <- data.frame(chromosome = gl(3,  10,  labels = c('A',  'B',  'C')),
             probeset = gl(3,  10,  labels = c('X',  'Y',  'Z')),
             ensg =  gl(3,  10,  labels = c('E1',  'E2',  'E3')),
             symbol = gl(3,  10,  labels = c('S1',  'S2',  'S3')),
             XXA_00 = rnorm(30),
             XXA_36 = rnorm(30),
             XXB_00 = rnorm(30))

然后我们在公式接口中使用aggregate:

And then we use aggregate with the formula interface:

df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,  
    data = df,  FUN = mean)

> df1
  ensg chromosome symbol      XXA_00      XXA_36      XXB_00
1   E1          A     S1 -0.02533499 -0.06150447 -0.01234508
2   E2          B     S2 -0.25165987  0.02494902 -0.01116426
3   E3          C     S3  0.09454154 -0.48468517 -0.25644569

这篇关于用聚合解决ddply任务的绝佳方法(希望获得更好的性能)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆