用叠加的PDF绘制直方图 [英] Plotting a histogram with overlaid PDF

查看:94
本文介绍了用叠加的PDF绘制直方图的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我之前几个问题的跟进.这是我正在使用的代码:

This is a follow-up to my previous couple of questions. Here's the code I'm playing with:

import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
dictOne = {'Name':['First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth', 'Seventh', 'Eighth', 'Ninth'],
           "A":[1, 2, -3, 4, 5, np.nan, 7, np.nan, 9],
           "B":[4, 5, 6, 5, 3, np.nan, 2, 9, 5],
           "C":[7, np.nan, 10, 5, 8, 6, 8, 2, 4]}
df2 = pd.DataFrame(dictOne)
column = 'B'
df2[df2[column] > -999].hist(column, alpha = 0.5)
param = stats.norm.fit(df2[column].dropna())   # Fit a normal distribution to the data
print(param)
pdf_fitted = stats.norm.pdf(df2[column], *param)
plt.plot(pdf_fitted, color = 'r')

我正在尝试对数据框中的单个列中的数字进行直方图绘制-我可以做到-但是具有覆盖的法线曲线...类似于

I'm trying to make a histogram of the numbers in a single column in the dataframe -- I can do this -- but with an overlaid normal curve...something like the last graph on here. I'm trying to get it working on this toy example so that I can apply it to my much larger dataset for real. The code I've pasted above gives me this graph:

为什么pdf_fitted与该图中的数据不匹配?如何覆盖适当的PDF?

Why doesn't pdf_fitted match the data in this graph? How can I overlay the proper PDF?

推荐答案

如果希望将直方图与真实的PDF进行比较,则应使用density=True绘制直方图.否则,您的归一化(幅度)将关闭.

You should plot the histogram with density=True if you hope to compare it to a true PDF. Otherwise your normalization (amplitude) will be off.

此外,在绘制pdf时,您需要指定x值(作为有序数组):

Also, you need to specify the x-values (as an ordered array) when you plot the pdf:

fig, ax = plt.subplots()

df2[df2[column] > -999].hist(column, alpha = 0.5, density=True, ax=ax)

param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values

plt.plot(x, stats.norm.pdf(x, *param), color = 'r')
plt.show()

顺便说一句,使用直方图比较连续变量与分布并不总是最好的. (您的样本数据是离散的,但是链接使用连续变量).箱的选择会混淆直方图的形状,这可能导致错误的推断.取而代之的是,ECDF更好地(无选择地)说明了连续变量的分布:

As an aside, using a histogram to compare continuous variables with a distribution is isn't always the best. (Your sample data are discrete, but the link uses a continuous variable). The choice of bins can alias the shape of your histogram, which may lead to incorrect inference. Instead, the ECDF is a much better (choice-free) illustration of the distribution for a continuous variable:

def ECDF(data):
    n = sum(data.notnull())
    x = np.sort(data.dropna())
    y = np.arange(1, n+1) / n
    return x,y

fig, ax = plt.subplots()

plt.plot(*ECDF(df2.loc[df2[column] > -999, 'B']), marker='o')

param = stats.norm.fit(df2[column].dropna())
x = np.linspace(*df2[column].agg([min, max]), 100) # x-values

plt.plot(x, stats.norm.cdf(x, *param), color = 'r')
plt.show()

这篇关于用叠加的PDF绘制直方图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆