Seaborn置信区间是否正确计算? [英] is seaborn confidence interval computed correctly?

查看:447
本文介绍了Seaborn置信区间是否正确计算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我必须承认我的统计知识充其量是生锈的:即使它刚刚崭露头角,它也不是我特别喜欢的学科,这意味着我很难理解它。



不过,我还是看了是:

  def ci(a,其中= 95,轴=无):
从值数组返回百分位数范围。
p = 50-哪个/ 2,50 +哪个/ 2
返回percentiles(a,p,axis)

现在也许我完全错过了,但这似乎只是与Wikipedia提出的计算完全不同。有人能解释这个差异吗?



举另一个例子,从评论中,我们为什么得到以下如此不同的结果:

 >> sb.utils.ci(np.arange(100))
array([2.475,96.525])

> ci_wp(np.arange(100))
[43.842250270646467,55.157749729353533]

并进行比较以及其他统计工具:

  def ci_std(a):
使用标准偏差计算误差幅度
m = np.mean(a)
s = np.std(a)
return ms,m + s

def ci_sem(a):
使用均值的标准误差计算误差幅度 b $ bm = np.mean(a)
s = sp.stats.sem(a)
return ms,m + s

哪个给了我们:

 >> ci_sem(np.arange(100))
(46.598850802411796,52.401149197588204)

> ci_std(np.arange(100))
(20.633929952277882,78.366070047722118)

或与随机样本:

  rng = np.random.RandomState(10)
a = rng.normal(size = 100)
打印sb.utils.ci(a)
打印ci_wp(a)
打印ci_sem(a)
打印ci_std(a)

...会产生:

  [ -1.9667006 2.19502303] 
(-0.1101230745774124,0.26895640045116026)
(-0.017774461397903049,0.17660778727165088)
(-0.88762281417683186,1.0464561400505796)

为什么Seaborn的数字与其他结果如此根本不同?

解决方案

使用此Wikipedia公式进行的计算是完全正确的。 Seaborn只是使用另一种方法: https://en.wikipedia.org/wiki/Bootstrapping_(statistics) 。 Dragicevic [1]对此进行了很好的描述:


[它]包括通过随机绘制观察值并替换产生的实验数据来生成许多备选数据集。假设这些数据集之间的可变性近似于采样误差,并用于计算所谓的引导置信区间。 [...]它非常通用,可用于多种发行版。


在Seaborn的源代码中, barplot 使用 estimate_statistic 引导数据然后计算其置信区间:

 >>> sb.utils.ci(sb.algorithms.bootstrap(np.arange(100)))
array([43.91,55.21025])

结果与您的计算一致。



[1] Dragicevic,P.(2016)。人机交互中公平的统计交流。在HCI的《现代统计方法》(第291-330页)中。占卜州施普林格。


First, I must admit that my statistics knowledge is rusty at best: even when it was shining new, it's not a discipline I particularly liked, which means I had a hard time making sense of it.

Nevertheless, I took a look at how the barplot graphs were calculating error bars, and was surprised to find a "confidence interval" (CI) used instead of (the more common) standard deviation. Researching more CI led me to this wikipedia article which seems to say that, basically, a CI is computed as:

Or, in pseudocode:

def ci_wp(a):
    """calculate confidence interval using Wikipedia's formula"""
    m = np.mean(a)
    s = 1.96*np.std(a)/np.sqrt(len(a))
    return m - s, m + s

But what we find in seaborn/utils.py is:

def ci(a, which=95, axis=None):
    """Return a percentile range from an array of values."""
    p = 50 - which / 2, 50 + which / 2
    return percentiles(a, p, axis)

Now maybe I'm missing this completely, but this seems just like a completely different calculation than the one proposed by Wikipedia. Can anyone explain this discrepancy?

To give another example, from comments, why do we get so different results between:

 >>> sb.utils.ci(np.arange(100))
 array([ 2.475, 96.525])

 >>> ci_wp(np.arange(100))
 [43.842250270646467,55.157749729353533]

And to compare with other statistical tools:

 def ci_std(a):
     """calculate margin of error using standard deviation"""
     m = np.mean(a)
     s = np.std(a)
     return m-s, m+s

 def ci_sem(a):
     """calculate margin of error using standard error of the mean"""
     m = np.mean(a)
     s = sp.stats.sem(a)
     return m-s, m+s

Which gives us:

>>> ci_sem(np.arange(100))
(46.598850802411796, 52.401149197588204)

>>> ci_std(np.arange(100))
(20.633929952277882, 78.366070047722118)

Or with a random sample:

rng = np.random.RandomState(10)
a = rng.normal(size=100)
print sb.utils.ci(a)
print ci_wp(a)
print ci_sem(a)
print ci_std(a)

... which yields:

[-1.9667006   2.19502303]
(-0.1101230745774124, 0.26895640045116026)
(-0.017774461397903049, 0.17660778727165088)
(-0.88762281417683186, 1.0464561400505796)

Why are Seaborn's numbers so radically different from the other results?

解决方案

Your calculation using this Wikipedia formula is completely right. Seaborn just uses another method: https://en.wikipedia.org/wiki/Bootstrapping_(statistics). It's well described by Dragicevic [1]:

[It] consists of generating many alternative datasets from the experimental data by randomly drawing observations with replacement. The variability across these datasets is assumed to approximate sampling error and is used to compute so-called bootstrap confidence intervals. [...] It is very versatile and works for many kinds of distributions.

In the Seaborn's source code, a barplot uses estimate_statistic which bootstraps the data then computes the confidence interval on it:

>>> sb.utils.ci(sb.algorithms.bootstrap(np.arange(100)))
array([43.91, 55.21025])

The result is consistent with your calculation.

[1] Dragicevic, P. (2016). Fair statistical communication in HCI. In Modern Statistical Methods for HCI (pp. 291-330). Springer, Cham.

这篇关于Seaborn置信区间是否正确计算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆