从频率表有效计算平均值和标准偏差 [英] Efficiently compute mean and standard deviation from a frequency table
问题描述
假设我有以下频率表.
> print(dat)
V1 V2
1 1 11613
2 2 6517
3 3 2442
4 4 687
5 5 159
6 6 29
# V1 = Score
# V2 = Frequency
如何有效地计算均值和标准差? 屈服:SD = 0.87 MEAN = 1.66. 按频率复制分数需要很长时间才能计算出来.
How can I efficiently compute the Mean and standard deviation? Yielding: SD=0.87 MEAN=1.66. Replicating the score by frequency takes too long to compute.
推荐答案
平均值很简单. SD有点棘手(不能再次使用fastmean(),因为分母中有n-1.
Mean is easy. SD is a little trickier (can't just use fastmean() again because there's an n-1 in the denominator.
> dat <- data.frame(freq=seq(6),value=runif(6)*100)
> fastmean <- function(dat) {
+ with(dat, sum(freq*value)/sum(freq) )
+ }
> fastmean(dat)
[1] 55.78302
>
> fastRMSE <- function(dat) {
+ mu <- fastmean(dat)
+ with(dat, sqrt(sum(freq*(value-mu)^2)/(sum(freq)-1) ) )
+ }
> fastRMSE(dat)
[1] 34.9316
>
> # To test
> expanded <- with(dat, rep(value,freq) )
> mean(expanded)
[1] 55.78302
> sd(expanded)
[1] 34.9316
请注意,fastRMSE
计算两次sum(freq)
.消除这种情况可能会导致速度稍有提高.
Note that fastRMSE
calculates sum(freq)
twice. Eliminating this would probably result in another minor speed boost.
基准化
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 12.433 13.5335 14.776 15.398 23.921
2 mean(with(dat, rep(value, freq))) 21.225 22.3990 22.714 23.406 86.434
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
>
> dat <- data.frame(freq=seq(60),value=runif(60)*100)
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 13.177 14.544 15.8860 17.2905 54.983
2 mean(with(dat, rep(value, freq))) 42.610 48.659 49.8615 50.6385 151.053
> dat <- data.frame(freq=seq(600),value=runif(600)*100)
> microbenchmark(
+ fastmean(dat),
+ mean( with(dat, rep(value,freq) ) )
+ )
Unit: microseconds
expr min lq median uq max
1 fastmean(dat) 15.706 17.489 25.8825 29.615 79.113
2 mean(with(dat, rep(value, freq))) 1827.146 2283.551 2534.7210 2884.933 26196.923
复制解决方案的条目数似乎为O(N ^ 2).
The replicating solution appears to be O( N^2 ) in the number of entries.
fastmean
解决方案似乎具有12ms左右的固定成本,之后可以很好地扩展.
The fastmean
solution appears to have a 12ms or so fixed cost after which it scales beautifully.
更多基准测试
Comparison with dot product.
dat <- data.frame(freq=seq(600),value=runif(600)*100)
dbaupp <- function(dat) {
total.count <- sum(dat$freq)
as.vector(dat$freq %*% dat$value) / total.count
}
microbenchmark(
fastmean(dat),
mean( with(dat, rep(value,freq) ) ),
dbaupp(dat)
)
Unit: microseconds
expr min lq median uq max
1 dbaupp(dat) 20.162 21.6875 25.6010 31.3475 104.054
2 fastmean(dat) 14.680 16.7885 20.7490 25.1765 94.423
3 mean(with(dat, rep(value, freq))) 489.434 503.6310 514.3525 583.2790 30130.302
这篇关于从频率表有效计算平均值和标准偏差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!