在重复观察的行中产生精确加权平均值 [英] Produce a precision weighted average among rows with repeated observations

查看:167
本文介绍了在重复观察的行中产生精确加权平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个类似于下面生成的数据框。一些个体对于特定变量具有多于一个观察值,并且每个变量对于估计具有相关联的标准误差(SE)。我想创建一个新的数据框,每个人只包含一行。对于具有多个观察值的个体,例如Kim或Bob,我需要基于估计的标准误差以及新计算的加权平均值的方差来计算精确加权平均值。例如,对于Bob,对于var1,这意味着我想要他的新数据帧中的var1值为:

  .mean(c(example $ var1 [2],example $ var1 [10]),
c(1 / example $ SE1 [2],1 / example $ SE1 [10])

,对于Bob的新SE1,其将是加权平均值的方差:

  1 / sum(1 / example $ SE1 [2] + 1 / example $ SE1 [10])
/ pre>

我已经尝试使用聚合函数,并能够计算值的算术平均值,但我写的简单函数不使用标准误差也不能它处理NAs。

 聚集(示例[,1:4],by = list(example [,5]) ,mean)

感谢任何帮助开发一些代码来解决这个问题。这里是示例数据集。

  set.seed(1562)
example = data.frame(rnorm(10,8,2))
colnames(example)[1] =(var1)
example $ SE1 = rnorm(10,2,1)
example $ var2 = rnorm(10,8,2)
example $ SE2 = rnorm(10,2,1)
example $ id =
c(Kim,Bob,Joe,Sam,Kim,Kim ,Joe,Sara,Jeff,Bob)
example $ SE1 [5] = NA
example $ var1 [5] = NA
example $ SE2 [10 ] = NA
example $ var2 [10] = NA
example

var1 SE1 var2 SE2 id
1 9.777769 2.451406 6.363250 2.2739566 Kim
2 8.753078 2.174308 6.219770 1.4978380 Bob
3 7.977356 2.107739 6.835998 2.1647437 Joe
4 11.113048 2.713242 11.091650 1.7018666 Sam
5 NA NA 11.769884 -0.1310218 Kim
6 5.271308 1.831475 6.818854 3.0294338 Kim
7 7.770062 2.094850 6.387607 0.2272348 Joe
8 9.837612 1.956486 8.517445 3.5126378 Sara
9 4.637518 2.516896 7.173460 2.0292454 Jeff
10 9.004425 1.592312 NA NA Bob


解决方案

我喜欢这些问题的 plyr 包。它应该在功能上等同于 aggregate ,但我认为它是很好的和方便使用。在网站上有很多例子和很好的〜20页的plyr简介。对于这个问题,由于数据以data.frame开头,而你想在另一端有另一个data.frame,我们使用 ddply()

  library(plyr)
#f1()
ddply(example,id,summarize,
newMean = weighted .mean(x = var1,1 / SE1,na.rm = TRUE),
newSE = 1 / sum(1 / SE1,na.rm = TRUE)

这会传回:

  id newmean newSE 
1 Bob 8.8982 0.91917
2 Jeff 4.6375 2.51690
3 Joe 7.8734 1.05064
4 Kim 7.1984 1.04829
5 Sam 11.1130 2.71324
6 Sara 9.8376 1.95649

还可以查看?summarize code> transform 为一些其他好的背景。如果需要更复杂的任务,您还可以向 plyr 函数传递一个匿名函数。



c $ c> data.table 包可以为某些任务证明更快:

  .table)
dt < - data.table(example,key =id)
#f2()
dt [,list(newMean = weighted.mean(var1,1 / SE1,na.rm = TRUE),
newSE = 1 / sum(1 / SE1,na.rm = TRUE)),
by =id]

快速基准:

  rbenchmark)
#f1 = plyr,#f2 = data.table
benchmark(f1(),f2(),
replications = 1000,
order =elapsed b $ b columns = c(test,elapsed,relative))

测试已过相对
2 f2()3.580 1.0000
1 f1()6.398 1.7872

因此 data.table()在我简单的笔记本电脑上,这个数据集的速度快1.8倍。


I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:

weighted.mean(c(example$var1[2], example$var1[10]), 
   c(1/example$SE1[2], 1/example$SE1[10]))

and for Bob's new SE1, which would be the variance of the weighted mean, to be:

1/sum(1/example$SE1[2] + 1/example$SE1[10])

I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.

aggregate(example[,1:4], by = list(example[,5]), mean)

Would appreciate any help in developing some code to work through this problem. Here is the example dataset.

set.seed(1562)
example=data.frame(rnorm(10,8,2))
colnames(example)[1]=("var1")
example$SE1=rnorm(10,2,1)
example$var2=rnorm(10,8,2)
example$SE2=rnorm(10,2,1)
example$id= 
  c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob")
example$SE1[5]=NA
example$var1[5]=NA
example$SE2[10]=NA
example$var2[10]=NA
example

       var1      SE1      var2        SE2   id
1   9.777769 2.451406  6.363250  2.2739566  Kim
2   8.753078 2.174308  6.219770  1.4978380  Bob
3   7.977356 2.107739  6.835998  2.1647437  Joe
4  11.113048 2.713242 11.091650  1.7018666  Sam
5         NA       NA 11.769884 -0.1310218  Kim
6   5.271308 1.831475  6.818854  3.0294338  Kim
7   7.770062 2.094850  6.387607  0.2272348  Joe
8   9.837612 1.956486  8.517445  3.5126378 Sara
9   4.637518 2.516896  7.173460  2.0292454 Jeff
10  9.004425 1.592312        NA         NA  Bob

解决方案

I like the plyr package for these sorts of problems. It should be functionally equivalent to aggregate, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we use ddply()

library(plyr)
#f1()
ddply(example, "id", summarize, 
      newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE),
      newSE = 1/sum(1/SE1, na.rm = TRUE)
      )

Which returns:

    id newmean   newSE
1  Bob  8.8982 0.91917
2 Jeff  4.6375 2.51690
3  Joe  7.8734 1.05064
4  Kim  7.1984 1.04829
5  Sam 11.1130 2.71324
6 Sara  9.8376 1.95649

Also check out ?summarize and ?transform for some other good background. You can also pass an anonymous function to the plyr functions if necessary for more complicated tasks.

Or use data.table package which can prove faster for some tasks:

library(data.table)
dt <- data.table(example, key="id")
#f2()
dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE),
          newSE = 1/sum(1/SE1, na.rm = TRUE)),
   by = "id"]

A quick benchmark:

library(rbenchmark)
#f1 = plyr, #f2 = data.table
benchmark(f1(), f2(), 
          replications = 1000,
          order = "elapsed",
          columns = c("test", "elapsed", "relative"))

      test elapsed relative
    2 f2()   3.580   1.0000
    1 f1()   6.398   1.7872

So data.table() is ~ 1.8x faster for this dataset on my simple laptop.

这篇关于在重复观察的行中产生精确加权平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆