如何在data.frame中创建表示特定列的均值的列? [英] How do I create a column of means of specific columns in a data.frame?

查看:115
本文介绍了如何在data.frame中创建表示特定列的均值的列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

非常感谢您的答复和回答.我可以看到我无意中遗漏了一些重要的细节,可以帮助您更好地理解我的问题.我试图使它简单而通用,但这实际上并没有帮助.这是具有更多信息的更新版本.

Thanks all for your responses and answers. I can see I've unintentionally left out some important details that may help you understand my problem better. I was trying to keep it simple and generic, but that didn't actually help. Here's an updated version with more information.

我有一个data.frame,其中有许多列来自BehaviorSpace生成的NetLogo模型.每列都是一个时间序列,表示在不同实验条件下的报告值,重复次数由运行编号和时间步长编号表示.例如(很抱歉,这很长,但是我想为您介绍一下数据):

I have a data.frame with many columns that came from a NetLogo model generated by BehaviorSpace. Each column is a time series that represents a reported value under different experimental conditions with repetitions represented by the run number and time step number. For example (sorry this is long but I'm trying to give you a flavor for the data):

# Start by building a fake data.frame that models some of the characteristics of mine:
df <- data.frame(run = c(rep(1,5), rep(2,5), rep(3,5), rep(4,5), rep(5,5), rep(6,5), rep(7,5), rep(8,5)))
df2 <- expand.grid(step = 1:5, fac.a = c(10,1000), fac.b = c(0.5,2.0))
df <- data.frame(run = df$run, rep = c(rep(1,20), rep(2,20)), step = df2$step, fac.a = df2$fac.a, fac.b = df2$fac.b)
log_growth <- function (a, b, x) {(1/(1+a*exp(-b*x))) + rnorm(1,0,0.2)}
set.seed(11)
df$treatment1 <- log_growth(df$fac.a, df$fac.b, df$step)
df$treatment2 <- log_growth(df$fac.a / 2, df$fac.b * 2, df$step)

这会将以下内容放入df:

This puts the following into df:

> df
   run rep step fac.a fac.b  treatment1  treatment2
1    1   1    1    10   0.5  0.05288201 0.356176584
2    1   1    2    10   0.5  0.12507561 0.600407158
3    1   1    3    10   0.5  0.22081815 0.804671117
4    1   1    4    10   0.5  0.33627099 0.920093934
5    1   1    5    10   0.5  0.46053940 0.971397427
6    2   1    1  1000   0.5 -0.08700866 0.009396323
7    2   1    2  1000   0.5 -0.08594375 0.018552055
8    2   1    3  1000   0.5 -0.08419297 0.042608835
9    2   1    4  1000   0.5 -0.08131981 0.102435481
10   2   1    5  1000   0.5 -0.07661880 0.232875872
11   3   1    1    10   2.0  0.33627099 0.920093934
12   3   1    2    10   2.0  0.75654214 1.002314651
13   3   1    3    10   2.0  0.88715737 1.003958435
14   3   1    4    10   2.0  0.90800192 1.003988593
15   3   1    5    10   2.0  0.91089154 1.003989145
16   4   1    1  1000   2.0 -0.08131981 0.102435481
17   4   1    2  1000   2.0 -0.03688314 0.860350536
18   4   1    3  1000   2.0  0.19880473 1.000926458
19   4   1    4  1000   2.0  0.66014952 1.003932891
20   4   1    5  1000   2.0  0.86791705 1.003988125
21   5   2    1    10   0.5  0.05288201 0.356176584
22   5   2    2    10   0.5  0.12507561 0.600407158
23   5   2    3    10   0.5  0.22081815 0.804671117
24   5   2    4    10   0.5  0.33627099 0.920093934
25   5   2    5    10   0.5  0.46053940 0.971397427
26   6   2    1  1000   0.5 -0.08700866 0.009396323
27   6   2    2  1000   0.5 -0.08594375 0.018552055
28   6   2    3  1000   0.5 -0.08419297 0.042608835
29   6   2    4  1000   0.5 -0.08131981 0.102435481
30   6   2    5  1000   0.5 -0.07661880 0.232875872
31   7   2    1    10   2.0  0.33627099 0.920093934
32   7   2    2    10   2.0  0.75654214 1.002314651
33   7   2    3    10   2.0  0.88715737 1.003958435
34   7   2    4    10   2.0  0.90800192 1.003988593
35   7   2    5    10   2.0  0.91089154 1.003989145
36   8   2    1  1000   2.0 -0.08131981 0.102435481
37   8   2    2  1000   2.0 -0.03688314 0.860350536
38   8   2    3  1000   2.0  0.19880473 1.000926458
39   8   2    4  1000   2.0  0.66014952 1.003932891
40   8   2    5  1000   2.0  0.86791705 1.003988125

因此,我之前所做的工作是使用by拆分数据帧,并希望获取每个步骤(这是一个时间序列)以及每种因素组合的平均值和标准差.

So what I did before is split up the data frame using by and wanted to obtain averages and standard deviations for every step (it's a time series) and each combination of factors.

查看了所有答案并重新考虑了我的问题之后,我认为在by的转换过程中可以更好地处理我要尝试执行的操作.我不完全确定该怎么做...我希望输出看起来像是各种摘要:

After having looked at all your answers and having reconsidered my problem, I think what I'm trying to do would be better handled during the conversion process of by. I'm not exactly sure how to do that... What I want the output to look like is a summary of sorts:

> df
   run fac.a fac.b  mean.treatment1  mean.treatment2 sd.treatment1 sd.treatment2
1    1    10   0.5        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1    10   2.0        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1  1000   0.5        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx
1    1  1000   2.0        xxxxxxxxx       xxxxxxxxxx    xxxxxxxxxx   xxxxxxxxxxx

这是aggregate的工作吗?感谢您的耐心配合和帮助. -格伦

Is this a job for aggregate? Thanks for your patience and help. -- Glenn

原始问题:

我有一个data.frame,其中有许多列,每列代表一个特定的实验条件,重复进行.

I have a data.frame with many columns, each of which represents a specific experimental condition with repetitions.

> df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5), mean.a = 0, mean.b = 0, mean.1 = 0, mean.2 = 0)
> df
        a.1       b.1       a.2       b.2 mean.a mean.b   sd.a   sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827      0      0      0      0
2 0.4171254 0.4883140 0.8282384 0.1215129      0      0      0      0
3 0.2291582 0.9419946 0.4089008 0.5665242      0      0      0      0
4 0.3807868 0.1889066 0.8271075 0.4022014      0      0      0      0
5 0.5863078 0.4991847 0.4082745 0.5637367      0      0      0      0

我想找到每种条件和重复的均值和标准差.到目前为止,最直接的方法似乎是:

I want to find means and standard deviations for each condition and repetition. So far the most direct way seems to be:

for (i in c("a.1", "a.2") {df$mean.a <- df$mean.a + df[[i]]}
df$mean.a <- df$mean.a / 2

但是我有很多列,并且它们无处不在,所以这似乎确实是劳动密集型和手动的.更好的方法是使用ave():

But I have a lot of columns, and they are all over the place, so this seems really labor intensive and manual. A little nicer method is to use ave():

df$mean.a <- with (df, ave(a.1, a.2))

但是,如果我想做sd(),我会神秘地得到NA:

But if I want to do sd() instead, I mysteriously get NAs:

df$sd.a <- with (df, ave(a.1, a.2, FUN = sd))
> df
        a.1       b.1       a.2       b.2    mean.a mean.b   sd.a   sd.b
1 0.9209433 0.3501444 0.3893140 0.3264827 0.9209433      0     NA      0
2 0.4171254 0.4883140 0.8282384 0.1215129 0.4171254      0     NA      0
3 0.2291582 0.9419946 0.4089008 0.5665242 0.2291582      0     NA      0
4 0.3807868 0.1889066 0.8271075 0.4022014 0.3807868      0     NA      0
5 0.5863078 0.4991847 0.4082745 0.5637367 0.5863078      0     NA      0

如果可能的话,我宁愿不使用外部软件包,但是似乎我缺少一些基本的东西. 这个问题类似,但必须使用data.tables,而不是data.frames.

I would prefer not to use external packages if possible, but it seems like I'm missing something basic. This question was similar, but had to do with data.tables, not data.frames.

另一个距离更近,但使用了ave( )也很麻烦,例如将第1-12、15-17和26列指定为主题列,而sd()神秘地产生了这些NA.似乎应该有一种简单的方法来执行此操作.几乎让我希望得到Excel. :-)

Another was even closer, but using ave() is also tedious to specify, for instance, columns 1-12, 15-17, and 26 as the subject columns, and mysteriously, sd() produces those NA's. Seems like there should be a straightforward way to do this. Almost makes me wish for Excel. :-)

推荐答案

让我们首先将您的数据转换为可接受的格式.请注意,此解决方案确实违反了您的最初要求,确实依赖于外部库,但是今天它们是非常普遍且真正的节省时间! (Ryr社区中的一种现象Hadley Wickham编写的plyr和reshape2)

Let us first bring your data into an acceptable format. Note that this solution does, against your initial requirements, indeed rely on external libraries, but they are very common and true timesavers today! (plyr and reshape2 by Hadley Wickham, who is a phenomenon in the R community)

# Note how I only used the data columns, initially, there is no mean and sd column in the data frame used at this stage.
df <- data.frame(a.1 = runif(5), b.1 = runif(5), a.2 = runif(5), b.2 = runif(5))

df$repetition = c(1:nrow(df))
library(reshape2)
tmp = melt(df, id.vars = "repetition")
names(tmp)[2] = "condition"

tmp$treatment = substring(tmp$condition,1,1)

这将产生:

> head(tmp)
  repetition condition     value treatment
1          1       a.1 0.6668952         a
2          2       a.1 0.1248151         a
3          3       a.1 0.7082199         a
4          4       a.1 0.9840956         a
5          5       a.1 0.4479190         a
6          1       b.1 0.9381539         b

现在,剩下的事情很简单,我们依靠流行的plyr软件包:

Now, the rest is easy, we rely on the popular plyr package:

library(plyr)
results = ddply(tmp, .(repetition, treatment), summarize, mean = mean(value), sd = sd(value) )

最终结果是

> head(results)
  repetition treatment      mean         sd
1          1         a 0.6777342 0.01532853
2          1         b 0.6734955 0.37428353
3          2         a 0.4533126 0.46456561
4          2         b 0.8441925 0.07260509
5          3         a 0.3967338 0.44050779
6          3         b 0.5886821 0.42635902

希望这就是您想要的.

如果您不想区分每个重复,而是在治疗级别上,则是另一个有趣的补充

One more interesting addition, if you do not want to differentiate each repetition, but rather on a treatment level

# addition
results = ddply(tmp, .( treatment), summarize, mean = mean(value), sd = sd(value) )

和结果:

> head(results)
  treatment      mean        sd
1         a 0.5817867 0.2954151
2         b 0.6212537 0.3219035

这篇关于如何在data.frame中创建表示特定列的均值的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆