在重复观察的行中产生精确加权平均值 [英] Produce a precision weighted average among rows with repeated observations
问题描述
我有一个类似于下面生成的数据框。一些个体对于特定变量具有多于一个观察值,并且每个变量对于估计具有相关联的标准误差(SE)。我想创建一个新的数据框,每个人只包含一行。对于具有多个观察值的个体,例如Kim或Bob,我需要基于估计的标准误差以及新计算的加权平均值的方差来计算精确加权平均值。例如,对于Bob,对于var1,这意味着我想要他的新数据帧中的var1值为:
.mean(c(example $ var1 [2],example $ var1 [10]),
c(1 / example $ SE1 [2],1 / example $ SE1 [10])
,对于Bob的新SE1,其将是加权平均值的方差:
1 / sum(1 / example $ SE1 [2] + 1 / example $ SE1 [10])
/ pre>
我已经尝试使用聚合函数,并能够计算值的算术平均值,但我写的简单函数不使用标准误差也不能它处理NAs。
聚集(示例[,1:4],by = list(example [,5]) ,mean)
感谢任何帮助开发一些代码来解决这个问题。这里是示例数据集。
set.seed(1562)
example = data.frame(rnorm(10,8,2))
colnames(example)[1] =(var1)
example $ SE1 = rnorm(10,2,1)
example $ var2 = rnorm(10,8,2)
example $ SE2 = rnorm(10,2,1)
example $ id =
c(Kim,Bob,Joe,Sam,Kim,Kim ,Joe,Sara,Jeff,Bob)
example $ SE1 [5] = NA
example $ var1 [5] = NA
example $ SE2 [10 ] = NA
example $ var2 [10] = NA
example
var1 SE1 var2 SE2 id
1 9.777769 2.451406 6.363250 2.2739566 Kim
2 8.753078 2.174308 6.219770 1.4978380 Bob
3 7.977356 2.107739 6.835998 2.1647437 Joe
4 11.113048 2.713242 11.091650 1.7018666 Sam
5 NA NA 11.769884 -0.1310218 Kim
6 5.271308 1.831475 6.818854 3.0294338 Kim
7 7.770062 2.094850 6.387607 0.2272348 Joe
8 9.837612 1.956486 8.517445 3.5126378 Sara
9 4.637518 2.516896 7.173460 2.0292454 Jeff
10 9.004425 1.592312 NA NA Bob
解决方案我喜欢这些问题的
plyr
包。它应该在功能上等同于aggregate
,但我认为它是很好的和方便使用。在网站上有很多例子和很好的〜20页的plyr简介。对于这个问题,由于数据以data.frame开头,而你想在另一端有另一个data.frame,我们使用ddply()
library(plyr)
#f1()
ddply(example,id,summarize,
newMean = weighted .mean(x = var1,1 / SE1,na.rm = TRUE),
newSE = 1 / sum(1 / SE1,na.rm = TRUE)
)
这会传回:
id newmean newSE
1 Bob 8.8982 0.91917
2 Jeff 4.6375 2.51690
3 Joe 7.8734 1.05064
4 Kim 7.1984 1.04829
5 Sam 11.1130 2.71324
6 Sara 9.8376 1.95649
还可以查看
?summarize
code> transform 为一些其他好的背景。如果需要更复杂的任务,您还可以向plyr
函数传递一个匿名函数。
c $ c> data.table 包可以为某些任务证明更快:
.table)
dt < - data.table(example,key =id)
#f2()
dt [,list(newMean = weighted.mean(var1,1 / SE1,na.rm = TRUE),
newSE = 1 / sum(1 / SE1,na.rm = TRUE)),
by =id]
快速基准:
rbenchmark)
#f1 = plyr,#f2 = data.table
benchmark(f1(),f2(),
replications = 1000,
order =elapsed b $ b columns = c(test,elapsed,relative))
测试已过相对
2 f2()3.580 1.0000
1 f1()6.398 1.7872
因此
data.table()
在我简单的笔记本电脑上,这个数据集的速度快1.8倍。I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:
weighted.mean(c(example$var1[2], example$var1[10]), c(1/example$SE1[2], 1/example$SE1[10]))
and for Bob's new SE1, which would be the variance of the weighted mean, to be:
1/sum(1/example$SE1[2] + 1/example$SE1[10])
I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.
aggregate(example[,1:4], by = list(example[,5]), mean)
Would appreciate any help in developing some code to work through this problem. Here is the example dataset.
set.seed(1562) example=data.frame(rnorm(10,8,2)) colnames(example)[1]=("var1") example$SE1=rnorm(10,2,1) example$var2=rnorm(10,8,2) example$SE2=rnorm(10,2,1) example$id= c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob") example$SE1[5]=NA example$var1[5]=NA example$SE2[10]=NA example$var2[10]=NA example var1 SE1 var2 SE2 id 1 9.777769 2.451406 6.363250 2.2739566 Kim 2 8.753078 2.174308 6.219770 1.4978380 Bob 3 7.977356 2.107739 6.835998 2.1647437 Joe 4 11.113048 2.713242 11.091650 1.7018666 Sam 5 NA NA 11.769884 -0.1310218 Kim 6 5.271308 1.831475 6.818854 3.0294338 Kim 7 7.770062 2.094850 6.387607 0.2272348 Joe 8 9.837612 1.956486 8.517445 3.5126378 Sara 9 4.637518 2.516896 7.173460 2.0292454 Jeff 10 9.004425 1.592312 NA NA Bob
解决方案I like the
plyr
package for these sorts of problems. It should be functionally equivalent toaggregate
, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we useddply()
library(plyr) #f1() ddply(example, "id", summarize, newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE), newSE = 1/sum(1/SE1, na.rm = TRUE) )
Which returns:
id newmean newSE 1 Bob 8.8982 0.91917 2 Jeff 4.6375 2.51690 3 Joe 7.8734 1.05064 4 Kim 7.1984 1.04829 5 Sam 11.1130 2.71324 6 Sara 9.8376 1.95649
Also check out
?summarize
and ?transform
for some other good background. You can also pass an anonymous function to theplyr
functions if necessary for more complicated tasks.Or use
data.table
package which can prove faster for some tasks:library(data.table) dt <- data.table(example, key="id") #f2() dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE), newSE = 1/sum(1/SE1, na.rm = TRUE)), by = "id"]
A quick benchmark:
library(rbenchmark) #f1 = plyr, #f2 = data.table benchmark(f1(), f2(), replications = 1000, order = "elapsed", columns = c("test", "elapsed", "relative")) test elapsed relative 2 f2() 3.580 1.0000 1 f1() 6.398 1.7872
So
data.table()
is ~ 1.8x faster for this dataset on my simple laptop.这篇关于在重复观察的行中产生精确加权平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!