Seaborn KDEPlot - 数据变化不够? [英] Seaborn KDEPlot - not enough variation in data?
问题描述
我有一个包含 ~900 行的数据框;我正在尝试为某些列绘制 KDEplots.在某些列中,大多数值是相同的最小值.当我包含太多最小值时,KDEPlot 会突然停止显示最小值.例如,以下包括 600 个值,其中 450 个是最小值,并且绘图看起来不错:
y = df.sort_values(by='col1', Ascending=False)['col1'].values[:600]sb.kdeplot(y)
但是包含 451 个最小值会产生非常不同的输出:
y = df.sort_values(by='col1', Ascending=False)['col1'].values[:601]sb.kdeplot(y)
最终我想绘制不同列的双变量 KDEPlots,但我想先了解这一点.
问题是为带宽"选择的默认算法
PS:正如@mwascom 在评论中提到的,在这种情况下 scipy.statsmodels.nonparametric.kde
被使用(不是 scipy.stats.gaussian_kde
).那里的默认值是 "scott";- 1.059 * A * nobs ** (-1/5.),其中 A 是 min(std(X),IQR/1.34)
.min()
阐明了行为的突然变化.IQR
是 四分位距",75% 和 25% 之间的差异.
自 Seaborn 0.11 起,statsmodel
后端已被删除,因此 kde 仅通过 scipy.stats.gaussian_kde
计算.
I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)
But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)
Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.
The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3
could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott'
and bw=0.3
. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})
for i, bw in enumerate(['scott', 0.3]):
for j, num_same in enumerate([400, 450, 500]):
y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
sns.kdeplot(y, bw=bw, ax=axs[i, j])
axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()
The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by @mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde
is used (not scipy.stats.gaussian_kde
). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34)
. The min()
clarifies the abrupt change in behavior. IQR
is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel
backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde
.
这篇关于Seaborn KDEPlot - 数据变化不够?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!