distplot 如何计算 kde 曲线? [英] How does distplot calculate the kde curve?

查看:63
本文介绍了distplot 如何计算 kde 曲线?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 seaborn 绘制数据.一切都很好,直到我的导师问我如何在以下代码中制作情节.

将 numpy 导入为 np将 seaborn 作为 sns 导入导入 matplotlib.pyplot 作为 pltx = np.random.normal(大小=100)sns.distplot(x)plt.show()

这段代码的结果是:

我的问题:

1- distplot 如何设法绘制此图?

2- 为什么从 -3 开始绘图并在 4 结束?

3- distplot 是否有任何参数函数或任何特定的数学函数来绘制这样的数据?

我使用 distplot 和 kde 来绘制我的数据,但我想知道这些函数背后的数学原理是什么.

解决方案

这里有一些代码试图说明如何绘制 kde 曲线.

代码以 100 个 xs 的随机样本开始.

这些 xs 显示在

PS:代替直方图或 kde,其他可视化 100 个随机数的方法是一组短线:

plt.plot(np.repeat(xs, 3), np.tile((0, -0.05, np.nan), N), lw=1, c='k', alpha=0.5)plt.ylim(ymin=-0.05)

或点(抖动,因此它们不会重叠):

plt.scatter(xs, -np.random.rand(N)/10, s=1, color='crimson')plt.ylim(ymin=-0.099)

I'm using seaborn for plotting data. Everything is fine until my mentor asked me how the plot is made in the following code for example.

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

x = np.random.normal(size=100)
sns.distplot(x)
plt.show()

The result of this code is:

My questions:

1- How does distplot manage to plot this?

2- Why starts the plot at -3 and ends at 4?

3- Is there any parametric function or any specific mathematical function that distplot uses to plot the data like this?

I use distplot and kde to plot my data, but I would like to know what is the maths behind those functions.

解决方案

Here is some code trying to illustrate how the kde curve is drawn.

The code starts with a random sample of 100 xs.

These xs are shown in a histogram. With density=True the histogram is normalized so that it's full area would be 1. (Standard, the bars of the histogram grow with the number of points. Internally, the complete area is calculated and each bar's height is divided by that area.)

To draw the kde, a gaussian "bell" curve is drawn around each of the N samples. These curves are summed, and normalized by dividing by N. The sigma of these curves is a free parameter. Default it is calculated by Scott's rule (N ** (-1/5) or 0.4 for 100 points, the green curve in the example plot).

The code below shows the result for different choices of sigma. Smaller sigmas enclose the given data stronger, larger sigmas appear more smooth. There is no perfect choice for sigma, it depends strongly on the data and what is known (or guessed) about the underlying distribution.

import matplotlib.pyplot as plt
import numpy as np

def gauss(x, mu, sigma):
    return np.exp(-((x - mu) / sigma) ** 2 / 2) / (sigma * np.sqrt(2 * np.pi))

N = 100
xs = np.random.normal(0, 1, N)

plt.hist(xs, density=True, label='Histogram', alpha=.4, ec='w')
x = np.linspace(xs.min() - 1, xs.max() + 1, 100)
for sigma in np.arange(.2, 1.2, .2):
    plt.plot(x, sum(gauss(x, xi, sigma) for xi in xs) / N, label=f'$\\sigma = {sigma:.1f}$')
plt.xlim(x[0], x[-1])
plt.legend()
plt.show()

PS: Instead of a histogram or a kde, other ways to visualize 100 random numbers are a set of short lines:

plt.plot(np.repeat(xs, 3), np.tile((0, -0.05, np.nan), N), lw=1, c='k', alpha=0.5)
plt.ylim(ymin=-0.05)

or dots (jittered, so they don't overlap):

plt.scatter(xs, -np.random.rand(N)/10, s=1, color='crimson')
plt.ylim(ymin=-0.099)

这篇关于distplot 如何计算 kde 曲线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆