“误差带"如何?在 Seaborn tsplot 计算? [英] How are the "error bands" in Seaborn tsplot calculated?

查看:41
本文介绍了“误差带"如何?在 Seaborn tsplot 计算?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图了解如何在 tsplot 中计算误差带.此处显示了错误带的示例.

I'm trying to understand how the error bands are calculated in the tsplot. Examples of the error bands are shown here.

当我绘制一些简单的东西时

When I plot something simple like

sns.tsplot(np.array([[0,1,0,1,0,1,0,1], [1,0,1,0,1,0,1,0], [.5,.5,.5,.5,.5,.5,.5,.5]]))

正如预期的那样,我在 y=0.5 处得到一条垂直线.顶部误差带也是 y=0.665 附近的垂直线,底部误差带是 y=0.335 附近的垂直线.有人能解释一下这些是如何推导出来的吗?

I get a vertical line at y=0.5 as expected. The top error band is also a vertical line at around y=0.665 and the bottom error band is a vertical line at around y=0.335. Can someone explain how these are derived?

推荐答案

问题和此答案涉及旧版本的 Seaborn,与新版本无关.请参阅下面的@CGFoX 的评论.

The question and this answer referred to old versions of Seaborn and is not relevant for new versions. See @CGFoX 's comment below.

我不是统计学家,但我通读了 seaborn 代码以确切了解发生了什么.分为三个步骤:

I'm not a statistician, but I read through the seaborn code in order to see exactly what's happening. There are three steps:

  1. Bootstrap 重采样.Seaborn 创建数据的重采样版本.每一个都是一个像你这样的 3x8 矩阵,但每一行都是从输入的三行.例如,一个可能是:

  1. Bootstrap resampling. Seaborn creates resampled versions of your data. Each of these is a 3x8 matrix like yours, but each row is randomly selected from the three rows of your input. For example, one might be:

[[ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]
 [ 0.5  0.5  0.5  0.5 0.5 0.5  0.5  0.5]
 [ 0.5  0.5  0.5  0.5  0.5  0.5  0.5  0.5]]

另一个可能是:

[[ 1.   0.   1.   0.   1.   0.   1.   0. ]
 [ 0.5  0.5  0.5  0.5 0.5  0.5  0.5  0.5]
 [ 0.   1.   0.   1.   0.   1.   0.   1. ]]

它创建 n_boot 个(默认为 10000 个).

It creates n_boot of these (10000 by default).

中心趋势估计.Seaborn 对 10000 个重新采样的数据版本的每一列的每一列运行一个函数.由于您没有指定此参数 (estimator),它会将列提供给均值函数 (numpy.mean with axis=0).引导迭代中的许多列的平均值将为 0.5,因为它们将是 [0, 0.5, 1], [0.5, 1, 0], [0.5, 0.5, 0.5] 等,但是您还会有一些 [1,1,0] 甚至一些 [1,1,1],这将导致更高的均值.

Central tendency estimation. Seaborn runs a function on each of the columns of each of the 10000 resampled versions of your data. Because you didn't specify this argument (estimator), it feeds the columns to a mean function (numpy.mean with axis=0). Lots of your columns in your bootstrap iterations are going to have a mean of 0.5, because they will be things like [0, 0.5, 1], [0.5, 1, 0], [0.5, 0.5, 0.5], etc. but you will also have some [1,1,0] and even some [1,1,1] which will result in higher means.

置信区间确定.对于每一列,seaborn 将根据数据的每个重新采样版本计算出的 1000 个均值估计值从最小到最大排序,并选择代表较高值的值和较低的 CI.默认情况下,它使用 68% 的 CI,因此如果您将所有 1000 个平均估计值排列起来,那么它将选择第 160 个和第 840 个.(840-160 = 680,或 1000 的 68%).

Confidence interval determination. For each column, seaborn sorts the 1000 estimates of the means calculated from each resampled version of the data from smallest to greatest, and picks the ones which represent the upper and lower CI. By default, it's using a 68% CI, so if you line up all 1000 mean estimates, then it will pick the 160th and the 840th. (840-160 = 680, or 68% of 1000).

一些注意事项:

  • 实际上只有 3^3 或 27 个可能的数组重采样版本,如果您使用诸如 mean 之类的函数,其中顺序无关紧要,那么只有 3 个!或 6 个.所以所有 10000 次引导迭代将与这 27 个版本之一相同,或者在无序情况下为 6 个版本.这意味着在这种情况下进行 10000 次迭代可能很愚蠢.

  • There are actually only 3^3, or 27, possible resampled versions of your array, and if you use a function such as mean where the order doesn't matter then there's only 3!, or 6. So all 10000 bootstrap iterations will be identical to one of those 27 versions, or 6 versions in the unordered case. This means that it's probably silly to do 10000 iterations in this case.

显示为置信区间的均值 0.3333... 和 0.6666... 是 [1,1,0] 和 [1,0,0] 或它们的重新排列版本的均值.

The means 0.3333... and 0.6666... that show up as your confidence intervals are the means for [1,1,0] and [1,0,0] or rearranged versions of those.

这篇关于“误差带"如何?在 Seaborn tsplot 计算?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆