用聚合解决ddply任务的绝佳方法(希望获得更好的性能) [英] Elegant way to solve ddply task with aggregate (hoping for better performance)
问题描述
我想通过一个名为ensg
的标识符变量来汇总data.frame
.数据框如下所示:
I would like to aggregate a data.frame
by an identifier variable called ensg
. The data frame looks like this:
chromosome probeset ensg symbol XXA_00 XXA_36 XXB_00
1 X 4938842 ENSMUSG00000000003 Pbsn 4.796123 4.737717 5.326664
我想计算具有相同ensg
值的行上每个数字列的平均值.这里的问题是,我想保留其他身份变量染色体和符号,因为它们对于相同的ensg
也是相同的.
I want to compute the mean for each numeric column over rows with same ensg
value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg
.
最后,我想创建一个data.frame
,其标识列为chromosome
,ensg
,symbol
,并且具有相同标识符的行上的数字列均值.我在ddply
中实现了此功能,但与aggregate
相比,它非常慢:
In the end I would like to have a data.frame
with identity columns chromosome
, ensg
, symbol
and mean of numeric columns over rows with same identifier. I implemented this in ddply
, but it is very slow when compared to aggregate
:
spec.mean <- function(eset.piece)
{
cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
}
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")
我的第一个汇总实现看起来像这样,
My first aggregate implementation looks like this,
mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);
,并且速度更快.但是aggregate
的问题是我必须重新附加描述变量.我还没有弄清楚如何将自定义函数与aggregate
一起使用,因为aggregate
不会传递数据帧,而只会传递矢量.
and is much faster. But the problem with aggregate
is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate
since aggregate
does not pass data frames but only vectors.
使用aggregate
可以做到这一点吗?还是有一些更快的方法可以使用ddply
做到这一点?
Is there an elegant way to do this with aggregate
? Or is there some faster way to do it with ddply
?
推荐答案
首先让我们定义一个玩具示例:
First let's define a toy example:
df <- data.frame(chromosome = gl(3, 10, labels = c('A', 'B', 'C')),
probeset = gl(3, 10, labels = c('X', 'Y', 'Z')),
ensg = gl(3, 10, labels = c('E1', 'E2', 'E3')),
symbol = gl(3, 10, labels = c('S1', 'S2', 'S3')),
XXA_00 = rnorm(30),
XXA_36 = rnorm(30),
XXB_00 = rnorm(30))
然后我们在公式接口中使用aggregate
:
And then we use aggregate
with the formula interface:
df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,
data = df, FUN = mean)
> df1
ensg chromosome symbol XXA_00 XXA_36 XXB_00
1 E1 A S1 -0.02533499 -0.06150447 -0.01234508
2 E2 B S2 -0.25165987 0.02494902 -0.01116426
3 E3 C S3 0.09454154 -0.48468517 -0.25644569
这篇关于用聚合解决ddply任务的绝佳方法(希望获得更好的性能)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!