为什么我的 kde 图显示为垂直线而不是曲线? [英] Why is my kde plot showing up as vertical lines instead of a curve?

查看:72
本文介绍了为什么我的 kde 图显示为垂直线而不是曲线?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在尝试为我拥有的数据(染色体起始位点的频率)绘制 KDE 图,尽管我遵循了 examples 确切地说,当我使用我的数据或生成的数据时,看起来像我自己的,整个情节混乱,只产生垂直线而不是正常曲线.我希望更熟悉 scikit 的人学习 KDE 可以帮助我弄清楚我做错了什么.

I have been trying to make a KDE plot for data I have (frequency of chromosome start sites), and although I follow the examples exactly, when I use my data or generated data that looks like my own, the entire plot messes up and produces only vertical lines instead of the normal curve. I was hoping someone more familiar with scikit learn KDE could help me figure out what I am doing wrong.

这是示例中生成数据的代码,其中一切正常:

Here is the code with generated data from the example, where everything runs fine:

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

X = np.concatenate((np.random.normal(0, 1, 14), np.random.normal(5, 1, 6)))[:, np.newaxis]
X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=1.0).fit(X) 
log_density = kde.score_samples(X_plot)

fig, ax = plt.subplots()
plt.fill_between(X_plot[:, 0], np.exp(log_density), color="b")
plt.plot(X, np.full_like(X, -0.01), '|k', markeredgewidth=.01)
ax.set_xlim(-5, 10)

这是我生成的数据的代码,看起来像我的数据.我在数据中有 1,000 个起始点,它们的值范围从 10000 到 824989.我更改了数据、linspace 范围和步长以及 x 轴,现在我得到了垂直线而不是曲线.我还更改了 y 限制,因为结果非常奇怪.

Here is the code with data I generated to look like my data. I have 1,000 start sites in the data and they range in value from 10000 to 824989. I changed the data, the linspace range and step, and the x axis, and now I get vertical lines instead of a curve. I also changed the y limits because they turned out really weird.

X = np.random.normal(10000, 824989, 1000)[:, np.newaxis]
X_plot = np.linspace(10000, 824989, 100000)[:, np.newaxis]
kde = KernelDensity(kernel='gaussian', bandwidth=1.0).fit(X) 
log_density = kde.score_samples(X_plot)

fig, ax = plt.subplots()
plt.fill_between(X_plot[:, 0], np.exp(log_density), color="b")
plt.plot(X, np.full_like(X, -0.01), '|k', markeredgewidth=.01)
ax.set_xlim(10000, 824989)
ax.set_ylim(-0.0001, 0.00061) 

我认为它一定与linspace有关.我也不明白为什么 score_samples() 也将 linspace 作为参数.

I think it must have something to do with the linspace. I don't really understand why score_samples() takes the linspace as a parameter either.

推荐答案

您的代码有两个问题:

  1. 核密度估计中使用的带宽需要更高,因为与示例相比,您的数据的标准偏差要大得多(您的数据的标准偏差为 824,989,而示例中使用的数据的标准偏差为 2.5).您需要使用大约 200,000 的带宽而不是 1 的带宽.例如,请参阅 关于核密度估计的维基百科文章.
  2. 使用 np.linspace() 的目的是生成一组数据点,在这些数据点上可以评估估计的核密度函数 kde.为了能够可视化数据的完整分布,np.linspace() 的第一个参数应设置为等于数据的最小值(而不是数据的平均值)和np.linspace() 的第二个参数应设置为等于数据的最大值(而不是数据的标准偏差).
  1. The bandwidth used in the kernel density estimation needs to be higher as your data has a much larger standard deviation compared to the example (your data has a standard deviation of 824,989 while the data used in the example has a standard deviation of 2.5). You would need to use a bandwidth of approximately 200,000 instead of a bandwidth of 1. See, for instance, the section on "A rule-of-thumb bandwidth estimator" in the Wikipedia article on Kernel density estimation.
  2. The purpose of using np.linspace() is to generate a set of data points at which the estimated kernel density function kde can be evaluated. In order to be able to visualize the full distribution of your data the first argument of np.linspace() should be set equal to the minimum of the data (instead of the mean of the data) and the second argument of np.linspace() should be set equal to the maximum of the data (instead of the standard deviation of the data).

我在下面包含了一个示例.

I included an example below.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity

mu = 10000 # mean
sigma = 824989 # standard deviation

# generate the data
X = np.random.normal(mu, sigma, 1000)[:, np.newaxis]

# estimate the optimal bandwidth
h = 1.06 * np.std(X) * (len(X) ** (- 1 / 5))

# estimate the density function
kde = KernelDensity(kernel='gaussian', bandwidth=h).fit(X)

# evaluate the density function
x = np.linspace(np.min(X), np.max(X), 100000)[:, np.newaxis]
log_density = kde.score_samples(x)
density = np.exp(log_density)

# plot the density function
plt.plot(x, density)

这篇关于为什么我的 kde 图显示为垂直线而不是曲线?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆