计算平均置信区间而不存储所有数据点 [英] Computing a mean confidence interval without storing all the data points

查看:24
本文介绍了计算平均置信区间而不存储所有数据点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于较大的 n(请参阅下文,了解如何确定足够大的内容),根据中心极限定理,将样本均值的分布视为正态分布(高斯分布)是安全的,但我'd 喜欢为任何 n 提供置信区间的过程.这样做的方法是使用具有 n-1 个自由度的 Student T 分布.

所以问题是,给定您一次收集或遇到的数据点流,您如何计算 c(例如,c=.95) 数据点均值的置信区间(不存储之前遇到的所有数据)?

另一种提问方式是:如何在不存储整个流的情况下跟踪数据流的第一时刻和第二时刻?

额外问题:您能否在不存储整个流的情况下跟踪更高的时刻?

解决方案

[非常感谢 John D Cook 在整理这个答案时学到的很多东西!]

首先,不使用平方和的原因如下:http://www.johndcook.com/blog/2008/09/26/

你应该做什么:

跟踪计数 (n)、均值 (u) 和数量 (s),从中可以确定样本方差和标准误差.(改编自 http://www.johndcook.com/standard_deviation.html.)>

初始化n = u = s = 0.

对于每个新数据点,x:

u0 = u;n++;u += (x - u)/n;s += (x - u0) * (x - u);

则样本方差为s/(n-1),样本均值的方差为s/(n-1)/n,标准样本均值的误差为SE = sqrt(s/(n-1)/n).

计算Student-t c-置信区间(c in (0,1)):

u [加减] SE*g((1-c)/2, n-1)

其中 g 是学生 t 分布的逆 cdf(又名分位数),均值为 0,方差为 1,取概率和自由度(比数据点数少一个)):

g(p,df) = sign(2*p-1)*sqrt(df)*sqrt(1/irib(1, -abs(2*p-1), df/2, 1)/2) - 1)

其中 irib 是逆正则化不完全 beta 函数:

irib(s0,s1,a,b) = z 使得 rib(s0,z,a,b) = s1

其中 rib 是正则化的不完全 beta 函数:

rib(x0,x1,a,b) = B(x0,x1,a,b)/B(a,b)

其中 B(a,b) 是欧拉 beta 函数,B(x0,x1,a,b) 是不完全 beta 函数:

B(a,b) = Gamma(a)*Gamma(b)/Gamma(a+b) =integral_0^1 t^(a-1)*(1-t)^(b-1) DTB(x0,x1,a,b) = 积分_x0^x1 t^(a-1)*(1-t)^(b-1) dt

典型的数值/统计库将具有 beta 函数的实现(或直接使用 Student-t 分布的逆 cdf).对于 C,事实上的标准是 Gnu Scientific Library (GSL).通常会给出 beta 函数的 3 参数版本;对 4 个参数的概括如下:

B(x0,x1,a,b) = B(x1,a,b) - B(x0,a,b)肋骨(x0,x1,a,b) = 肋骨(x1,a,b) - 肋骨(x0,a,b)

最后,这是在 Mathematica 中的一个实现:

(* 取当前的 {n,u,s} 和新的数据点;返回新的 {n,u,s}.*)更新[{n_,u_,s_}, x_] := {n+1, u+(xu)/(n+1), s+(xu)(x-(u+(xu)/(n+1)))}需要[假设测试`"];g[p_, df_] := InverseCDF[StudentTDistribution[df], p](* 给定 n,u,s 和置信水平 c 的平均 CI.*)mci[n_,u_,s_, c_:.95] := With[{d = Sqrt[s/(n-1)/n]*g[(1-c)/2, n-1]},{u+d, u-d}]

比较

StudentTCI[u, SE, n-1, ConfidenceLevel->c]

或者,当整个数据点列表可用时,

MeanCI[list, ConfidenceLevel->c]

最后,如果你不想为 beta 函数之类的东西加载数学库,你可以为 -g((1-c)/2, n-1) 硬编码一个查找表.这里是 c=.95n=2..100:

12.706204736174698, 4.302652729749464, 3.182446305283708, 2.7764451051977934,2.570581835636314, 2.4469118511449666, 2.3646242515927853, 2.306004135204168,2.262157162798205, 2.2281388519862735, 2.2009851600916384, 2.178812829667226,2.1603686564627917, 2.1447866879178012, 2.1314495455559774, 2.1199052992212533,2.1098155778333156, 2.100922040241039, 2.093024054408307, 2.0859634472658626,2.0796138447276835, 2.073873067904019, 2.0686576104190477, 2.0638985616280254,2.0595385527532963, 2.05552943864287, 2.051830516480281, 2.048407141795243,2.0452296421327034, 2.042272456301236, 2.039513446396408, 2.0369333434600976,2.0345152974493392, 2.032244509317719, 2.030107928250338, 2.0280940009804462,2.0261924630291066, 2.024394163911966, 2.022690920036762, 2.0210753903062715,2.0195409704413745, 2.018081702818439, 2.016692199227822, 2.0153675744437627,2.0141033888808457, 2.0128955989194246, 2.011740513729764, 2.0106347576242314,2.0095752371292335, 2.0085591121007527, 2.007583770315835, 2.0066468050616857,2.005745995317864, 2.0048792881880577, 2.004044783289136, 2.0032407188478696,2.002465459291016, 2.001717484145232, 2.000995378088259, 2.0002978220142578,1.9996235849949402, 1.998971517033376, 1.9983405425207483, 1.997729654317692,1.9971379083920013, 1.9965644189523084, 1.996008354025304, 1.9954689314298386,1.994945415107228, 1.9944371117711894, 1.9939433678456229, 1.9934635666661884,1.9929971258898527, 1.9925434951809258, 1.992102154002232, 1.9916726096446793,1.9912543953883763, 1.9908470688116922, 1.9904502102301198, 1.990063421254452,1.989686323456895, 1.9893185571365664, 1.9889597801751728, 1.9886096669757192,1.9882679074772156, 1.9879342062390228, 1.9876082815890748, 1.9872898648311672,1.9869786995062702, 1.986674540703777, 1.986377154418625, 1.9860863169510985,1.9858018143458114, 1.9855234418666061, 1.9852510035054973, 1.9849843115224508,1.9847231860139618、1.98446745450849、1.9842169515863888

渐近逼近 c=.95 的正态 (0,1) 分布的逆 CDF,即:

-sqrt(2)*InverseErf(-c) = 1.9599639845400542355245944305205​​51527955550...

参见 http://mathworld.wolfram.com/InverseErf.html 了解逆erf() 函数.请注意,g((1-.95)/2,n-1) 在至少有 474 个数据点之前不会四舍五入到 1.96.当有 29 个数据点时,它舍入为 2.0.

根据经验,您应该使用 Student-t 而不是 n 的正常近似值,最多至少为 300,而不是根据传统知识为 30.参见http://www.johndcook.com/blog/2008/11/12/.

另请参阅康奈尔大学的 Ping Li 撰写的改进压缩计数".

For large n (see below for how to determine what's large enough), it's safe to treat, by the central limit theorem, the distribution of the sample mean as normal (gaussian) but I'd like a procedure that gives a confidence interval for any n. The way to do that is to use a Student T distribution with n-1 degrees of freedom.

So the question is, given a stream of data points that you collect or encounter one at a time, how do you compute a c (eg, c=.95) confidence interval on the mean of the data points (without storing all of the previously encountered data)?

Another way to ask this is: How do you keep track of the first and second moments for a stream of data without storing the whole stream?

BONUS QUESTION: Can you keep track of higher moments without storing the whole stream?

解决方案

[Huge thanks to John D Cook for much of what I learned in putting together this answer!]

First, here's the reason not to use sum-of-squares: http://www.johndcook.com/blog/2008/09/26/

What you should do instead:

Keep track of the count (n), the mean (u), and a quantity (s) from which sample variance and standard error can be determined. (Adapted from http://www.johndcook.com/standard_deviation.html.)

Initialize n = u = s = 0.

For each new datapoint, x:

u0 = u;
n ++;
u += (x - u) / n;
s += (x - u0) * (x - u);

The sample variance is then s/(n-1), the variance of the sample mean is s/(n-1)/n, and the standard error of the sample mean is SE = sqrt(s/(n-1)/n).

It remains to compute the Student-t c-confidence interval (c in (0,1)):

u [plus or minus] SE*g((1-c)/2, n-1)

where g is the inverse cdf (aka quantile) of the Student-t distribution with mean 0 and variance 1, taking a probability and the degrees of freedom (one less than the number of data points):

g(p,df) = sign(2*p-1)*sqrt(df)*sqrt(1/irib(1, -abs(2*p-1), df/2, 1/2) - 1)

where irib is the inverse regularized incomplete beta function:

irib(s0,s1,a,b) = z such that rib(s0,z,a,b) = s1

where rib is the regularized incomplete beta function:

rib(x0,x1,a,b) = B(x0,x1,a,b) / B(a,b)

where B(a,b) is the Euler beta function and B(x0,x1,a,b) is the incomplete beta function:

B(a,b) = Gamma(a)*Gamma(b)/Gamma(a+b) = integral_0^1 t^(a-1)*(1-t)^(b-1) dt
B(x0,x1,a,b) = integral_x0^x1 t^(a-1)*(1-t)^(b-1) dt

Typical numerical/statistics libraries will have implementations of the beta function (or the inverse cdf of the Student-t distribution directly). For C, the de facto standard is the Gnu Scientific Library (GSL). Often a 3-argument version of the beta function is given; the generalization to 4 arguments is as follows:

B(x0,x1,a,b) = B(x1,a,b) - B(x0,a,b)
rib(x0,x1,a,b) = rib(x1,a,b) - rib(x0,a,b)

Finally, here is an implementation in Mathematica:

(* Take current {n,u,s} and new data point; return new {n,u,s}. *)
update[{n_,u_,s_}, x_] := {n+1, u+(x-u)/(n+1), s+(x-u)(x-(u+(x-u)/(n+1)))}

Needs["HypothesisTesting`"];
g[p_, df_] := InverseCDF[StudentTDistribution[df], p]

(* Mean CI given n,u,s and confidence level c. *)
mci[n_,u_,s_, c_:.95] := With[{d = Sqrt[s/(n-1)/n]*g[(1-c)/2, n-1]}, 
  {u+d, u-d}]

Compare to

StudentTCI[u, SE, n-1, ConfidenceLevel->c]

or, when the entire list of data points is available,

MeanCI[list, ConfidenceLevel->c]

Finally, if you don't want to load math libraries for things like the beta function, you can hardcode a lookup table for -g((1-c)/2, n-1). Here it is for c=.95 and n=2..100:

12.706204736174698, 4.302652729749464, 3.182446305283708, 2.7764451051977934, 2.570581835636314, 2.4469118511449666, 2.3646242515927853, 2.306004135204168, 2.262157162798205, 2.2281388519862735, 2.2009851600916384, 2.178812829667226, 2.1603686564627917, 2.1447866879178012, 2.131449545559774, 2.1199052992212533, 2.1098155778333156, 2.100922040241039, 2.093024054408307, 2.0859634472658626, 2.0796138447276835, 2.073873067904019, 2.0686576104190477, 2.0638985616280254, 2.0595385527532963, 2.05552943864287, 2.051830516480281, 2.048407141795243, 2.0452296421327034, 2.042272456301236, 2.039513446396408, 2.0369333434600976, 2.0345152974493392, 2.032244509317719, 2.030107928250338, 2.0280940009804462, 2.0261924630291066, 2.024394163911966, 2.022690920036762, 2.0210753903062715, 2.0195409704413745, 2.018081702818439, 2.016692199227822, 2.0153675744437627, 2.0141033888808457, 2.0128955989194246, 2.011740513729764, 2.0106347576242314, 2.0095752371292335, 2.0085591121007527, 2.007583770315835, 2.0066468050616857, 2.005745995317864, 2.0048792881880577, 2.004044783289136, 2.0032407188478696, 2.002465459291016, 2.001717484145232, 2.000995378088259, 2.0002978220142578, 1.9996235849949402, 1.998971517033376, 1.9983405425207483, 1.997729654317692, 1.9971379083920013, 1.9965644189523084, 1.996008354025304, 1.9954689314298386, 1.994945415107228, 1.9944371117711894, 1.9939433678456229, 1.993463566661884, 1.9929971258898527, 1.9925434951809258, 1.992102154002232, 1.9916726096446793, 1.9912543953883763, 1.9908470688116922, 1.9904502102301198, 1.990063421254452, 1.989686323456895, 1.9893185571365664, 1.9889597801751728, 1.9886096669757192, 1.9882679074772156, 1.9879342062390228, 1.9876082815890748, 1.9872898648311672, 1.9869786995062702, 1.986674540703777, 1.986377154418625, 1.9860863169510985, 1.9858018143458114, 1.9855234418666061, 1.9852510035054973, 1.9849843115224508, 1.9847231860139618, 1.98446745450849, 1.9842169515863888

which is asymptotically approaching the inverse CDF of a normal(0,1) distribution for c=.95, which is:

-sqrt(2)*InverseErf(-c) = 1.959963984540054235524594430520551527955550...

See http://mathworld.wolfram.com/InverseErf.html for the inverse erf() function. Notice that g((1-.95)/2,n-1) doesn't round to 1.96 until there are at least 474 data points. It rounds to 2.0 when there are 29 data points.

As a rule of thumb, you should use Student-t instead of the normal approximation for n up to at least 300, not 30 per conventional wisdom. Cf. http://www.johndcook.com/blog/2008/11/12/.

See also "Improving Compressed Counting" by Ping Li of Cornell.

这篇关于计算平均置信区间而不存储所有数据点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆