Seaborn KDEPlot - 数据变化不够? [英] Seaborn KDEPlot - not enough variation in data?

查看:51
本文介绍了Seaborn KDEPlot - 数据变化不够?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含 ~900 行的数据框;我正在尝试为某些列绘制 KDEplots.在某些列中,大多数值是相同的最小值.当我包含太多最小值时,KDEPlot 会突然停止显示最小值.例如,以下包括 600 个值,其中 450 个是最小值,并且绘图看起来不错:

y = df.sort_values(by='col1', Ascending=False)['col1'].values[:600]sb.kdeplot(y)

但是包含 451 个最小值会产生非常不同的输出:

y = df.sort_values(by='col1', Ascending=False)['col1'].values[:601]sb.kdeplot(y)

最终我想绘制不同列的双变量 KDEPlots,但我想先了解这一点.

解决方案

问题是为带宽"选择的默认算法

PS:正如@mwascom 在评论中提到的,在这种情况下 scipy.statsmodels.nonparametric.kde 被使用(不是 scipy.stats.gaussian_kde).那里的默认值是 "scott";- 1.059 * A * nobs ** (-1/5.),其中 A 是 min(std(X),IQR/1.34).min() 阐明了行为的突然变化.IQR四分位距"75% 和 25% 之间的差异.

自 Seaborn 0.11 起,statsmodel 后端已被删除,因此 kde 仅通过 scipy.stats.gaussian_kde 计算.

I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:

y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)

But including 451 of the minimum values gives a very different output:

y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)

Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.

解决方案

The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.

The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.

Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()

fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})

for i, bw in enumerate(['scott', 0.3]):
    for j, num_same in enumerate([400, 450, 500]):
        y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
        sns.kdeplot(y, bw=bw, ax=axs[i, j])
        axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()

The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.

PS: As mentioned by @mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.

Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.

这篇关于Seaborn KDEPlot - 数据变化不够?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆