按组计算数据帧中多个变量的均值和标准差 [英] Compute mean and standard deviation by group for multiple variables in a data.frame
问题描述
编辑-该问题的原标题为<<在R中进行长而宽的数据重塑>>
Edit -- This question was originally titled << Long to wide data reshaping in R >>
我只是在学习R,并试图找到将其应用于帮助我生命中的其他人。作为测试用例,我正在重塑一些数据,但按照我在网上找到的示例进行操作时遇到麻烦。我从以下内容开始:
I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this:
ID Obs 1 Obs 2 Obs 3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36
最后我想得到的是会看起来像这样:
And what I want to end up with will look like this:
ID Obs 1 mean Obs 1 std dev Obs 2 mean Obs 2 std dev
1 x x x x
2 x x x x
3 x x x x
依此类推。我不确定的是我是否需要其他格式的长数据?我认为数学部分(查找均值和标准差)将是最简单的部分,但是我一直无法找到一种方法似乎可以正确地重塑数据以开始该过程。
And so forth. What I'm unsure of is whether I need additional information in my long-form data, or what. I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.
非常感谢您的帮助。
推荐答案
这是一个汇总问题,不是最初提出的问题那样的重塑问题-我们希望通过ID将每一列汇总为均值和标准差。有许多处理此类问题的软件包。在R的基数中,可以使用 aggregate
这样完成(假设 DF
是输入数据帧):
This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using aggregate
like this (assuming DF
is the input data frame):
ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))
注1: ag
是其中某些列为矩阵的数据框。尽管最初看起来很奇怪,但实际上它简化了访问。 ag
的列数与输入的 DF
相同。它的第一列 ag [[1]]
是 ID
,其余的第i列 ag [[i + 1]]
(或等价 ag [-1] [[i]]
)是第i个统计数据的矩阵输入观察栏。如果希望访问第i个观测值的第j个统计信息,则为 ag [[i + 1]] [,j]
,也可以写为 ag [-1] [[i]] [,j]
。
Note 1: A commenter pointed out that ag
is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. ag
has the same number of columns as the input DF
. Its first column ag[[1]]
is ID
and the ith column of the remainder ag[[i+1]]
(or equivalanetly ag[-1][[i]]
) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j]
which can also be written as ag[-1][[i]][, j]
.
另一方面,假设输入中每个观察值都有 k
个统计列(其中k = 2在问题中)。然后,如果我们将输出展平,然后访问第i个观察列的第j个统计信息,则必须使用更复杂的 ag [[k *(i-1)+ j + 1]]
或等效地 ag [-1] [[k *(i-1)+ j]]
。
On the other hand, suppose there are k
statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]]
or equivalently ag[-1][[k*(i-1)+j]]
.
例如,比较第一个表达式和第二个表达式的简单性:
For example, compare the simplicity of the first expression vs. the second:
ag[-1][[2]]
## mean sd
## [1,] 36.333 10.2144
## [2,] 32.250 4.1932
## [3,] 43.500 4.9497
ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
## Obs_2.mean Obs_2.sd
## 1 36.333 10.2144
## 2 32.250 4.1932
## 3 43.500 4.9497
注2:可重复输入的形式是:
Note 2: The input in reproducible form is:
Lines <- "ID Obs_1 Obs_2 Obs_3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36"
DF <- read.table(text = Lines, header = TRUE)
这篇关于按组计算数据帧中多个变量的均值和标准差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!