Seaborn的kdeplot上的带宽混乱 [英] Confusion with bandwidth on seaborn's kdeplot

查看:54
本文介绍了Seaborn的kdeplot上的带宽混乱的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

lineslist代表一组以MHz为单位的线(对于某些化学光谱而言).我知道用来探测这些线的激光器的线宽为5 MHz.因此,天真地,这些带宽为 5 的线的核密度估计应该给我在使用上述激光的实验中产生的连续分布.

lineslist, below, represents a set of lines (for some chemical spectrum, let's say), in MHz. I know the linewidth of the laser used to probe these lines to be 5 MHz. So, naively, the kernel density estimate of these lines with a bandwidth of 5 should give me the continuous distribution that would be produced in an experiment using the aforementioned laser.

以下代码:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 ,  -75.71982528,  -12.1897835 ,  -73.94903264,
   -178.14293936, -123.51339541, -118.11826988,  -50.19812838,
    -43.69282206,  -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5)
plt.show()

收益

看起来像带宽远大于 5 MHz 的高斯.

Which looks like a Gaussian with bandwidth much larger than 5 MHz.

我猜测由于某种原因,kdeplot的带宽与图本身的单位不同.最高和最低线之间的间隔为〜170.0 MHz.假设我需要通过这个因素重新调整带宽:

I'm guessing that for some reason, the bandwidth of the kdeplot has different units than the plot itself. The separation between the highest and lowest line is ~170.0 MHz. Supposing that I need to rescale the bandwidth by this factor:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
lineslist=np.array([-153.3048645 ,  -75.71982528,  -12.1897835 ,  -73.94903264,
   -178.14293936, -123.51339541, -118.11826988,  -50.19812838,
    -43.69282206,  -34.21268228])
sns.kdeplot(lineslist, shade=True, color="r",bw=5/(np.max(lineslist)-np.min(lineslist)))
plt.show()

我得到:

线路似乎具有预期的5 MHz带宽.

With lines that seem to have the expected 5 MHz bandwidth.

尽管该解决方案非常花哨,但我已将其从中撤出,我很好奇是否有人更熟悉seaborn的kdeplot内部结构可以对此发表评论.

As dandy as that solution is, I've pulled it from my arse, and I'm curious whether someone more familiar with seaborn's kdeplot internals can comment on why this is.

谢谢,

塞缪尔

推荐答案

要注意的一件事是Seaborn本身并没有真正处理带宽-它将设置按原样传递给SciPy或Statsmodels 软件包,具体取决于您安装的内容.(它更喜欢Statsmodels,但会退回到SciPy.)

One thing to note is that Seaborn doesn't actually handle the bandwidth itself - it passes the setting on more-or-less as-is to either SciPy or the Statsmodels packages, depending on what you have installed. (It prefers Statsmodels, but will fall back to SciPy.)

各个子包中这个参数的文档有点混乱,但据我所知,这里的关键问题是 SciPy 的设置是一个带宽因素,而不是带宽本身.也就是说,该因子(有效)乘以您要绘制的数据的标准偏差即可得出内核中使用的实际带宽.

The documentation for this parameter in the various sub-packages is a little confusing, but from what I can tell, the key issue here is that the setting for SciPy is a bandwidth factor, rather than a bandwidth itself. That is, this factor is (effectively) multiplied by the standard deviation of the data you're plotting to give you the actual bandwidth used in the kernels.

因此,对于SciPy,如果您有一个固定的数字要用作带宽,则需要除以数据标准偏差.而且,如果您试图一致地绘制多个数据集,则需要针对每个数据集的标准偏差进行调整.通过按范围缩放,可以有效地进行此调整-但同样,不是数据范围就是使用的数字,而是数据的标准偏差.

So with SciPy, if you have a fixed number which you want to use as your bandwidth, you need to divide through by your data standard deviation. And if you're trying to plot multiple datasets consistently, you need to adjust for the standard deviation of each dataset. This adjustment effectively what you did by scaling by the range -- but again, it's not the range of the data that's the number used, but the standard deviation of the data.

更令人困惑的是,Statsmodels 在给定标量值时期望真实带宽,而不是乘以样本标准差的因子.因此,根据您使用的后端,Seaborn 的行为会有所不同.无法直接告诉Seaborn使用哪个后端-最好的测试方法可能是尝试 import statsmodels ,并查看其是否成功(直接获取带宽)或失败(获取带宽 factor)).

To make things all the more confusing, Statsmodels expects the true bandwidth when given a scalar value, rather than a factor that's multiplied by the standard deviation of the sample. So depending on what backend you're using, Seaborn will behave differently. There's no direct way to tell Seaborn which backend to use - the best way to test is probably trying to import statsmodels, and seeing if that succeeds (takes bandwidth directly) or fails (takes bandwidth factor).

顺便说一下,这些结果是针对 Seaborn 0.7.0 版测试的 - 我希望(希望?)将来的版本可能会改变这种行为.

这篇关于Seaborn的kdeplot上的带宽混乱的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆