通过重复输入来绘制置信度和预测间隔 [英] Plotting confidence and prediction intervals with repeated entries

查看:102
本文介绍了通过重复输入来绘制置信度和预测间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个变量的相关图,x轴上的预测变量(温度)和y轴上的响应变量(密度).我最适合的最小二乘回归线是二阶多项式.我还要绘制置信度和预测间隔. 答案中描述的方法似乎很完美.但是,我的数据集(n = 2340)对许多(x,y)对都有重复的条目.我得到的情节看起来像这样:

这是我的相关代码(从上面的链接答案中略作修改):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf    
from statsmodels.stats.outliers_influence import summary_table

d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)

x = df.temp
y = df.dens

plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')

# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})

# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)',   data=df).fit()

# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2  $R^2$=%.2f' % poly_2.rsquared, 
     alpha=0.9)

prstd, iv_l, iv_u = wls_prediction_std(poly_2)

st, data, ss2 = summary_table(poly_2, alpha=0.05)

fittedvalues = data[:,2]
predict_mean_se  = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T

# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))

plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)

按预期,打印语句的值为0.0. 但是,我需要条线作为多项式最佳拟合线,以及置信度和预测间隔(而不是我当前在绘图中拥有的多条线).有什么想法吗?

更新: @kpie 的第一个答案之后,我根据温度对置信度和预测间隔数组进行了排序:

data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}

df_intervals = pd.DataFrame(data=data_intervals)

df_intervals_sort = df_intervals.sort(columns='temp')

这取得了预期的结果:

解决方案

您需要根据温度对预测值进行排序.我认为*

因此,要获得优美的曲线,您必须使用 numpy.polynomial.polynomial.polyfit 这将返回系数列表.您将不得不将x和y数据分成2个列表,以使其适合该函数.

然后可以使用以下命令绘制此函数:

def strPolynomialFromArray(coeffs):
    return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])

from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)

您可以进一步研究绘制平滑线此处,只要记住所有线都是线性样条线,是因为连续曲率是不合理的.

我认为多项式拟合是通过最小二乘拟合完成的(过程在此描述)

祝你好运!

I have a correlation plot for two variables, the predictor variable (temperature) on the x-axis, and the response variable (density) on the y-axis. My best fit least squares regression line is a 2nd order polynomial. I would like to also plot confidence and prediction intervals. The method described in this answer seems perfect. However, my dataset (n=2340) has repeated entries for many (x,y) pairs. My resulting plot looks like this:

Here is my relevant code (slightly modified from linked answer above):

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.sandbox.regression.predstd import wls_prediction_std
import statsmodels.formula.api as smf    
from statsmodels.stats.outliers_influence import summary_table

d = {'temp': x, 'dens': y}
df = pd.DataFrame(data=d)

x = df.temp
y = df.dens

plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(x,y, s=10, alpha=0.3)
plt.xlabel('temp')
plt.ylabel('density')

# points linearly spaced for predictor variable
x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), 100)})

# 2nd order polynomial
poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)',   data=df).fit()

# this correctly plots my single 2nd-order poly best-fit line:
plt.plot(x1.temp, poly_2.predict(x1), 'g-', label='Poly n=2  $R^2$=%.2f' % poly_2.rsquared, 
     alpha=0.9)

prstd, iv_l, iv_u = wls_prediction_std(poly_2)

st, data, ss2 = summary_table(poly_2, alpha=0.05)

fittedvalues = data[:,2]
predict_mean_se  = data[:,3]
predict_mean_ci_low, predict_mean_ci_upp = data[:,4:6].T
predict_ci_low, predict_ci_upp = data[:,6:8].T

# check we got the right things
print np.max(np.abs(poly_2.fittedvalues - fittedvalues))
print np.max(np.abs(iv_l - predict_ci_low))
print np.max(np.abs(iv_u - predict_ci_upp))

plt.plot(x, y, 'o')
plt.plot(x, fittedvalues, '-', lw=2)
plt.plot(x, predict_ci_low, 'r--', lw=2)
plt.plot(x, predict_ci_upp, 'r--', lw=2)
plt.plot(x, predict_mean_ci_low, 'r--', lw=2)
plt.plot(x, predict_mean_ci_upp, 'r--', lw=2)

The print statements evaluate to 0.0, as expected. However, I need single lines for the polynomial best fit line, and the confidence and prediction intervals (rather than the multiple lines I currently have in my plot). Any ideas?

Update: Following first answer from @kpie, I ordered my confidence and prediction interval arrays according to temperature:

data_intervals = {'temp': x, 'predict_low': predict_ci_low, 'predict_upp': predict_ci_upp, 'conf_low': predict_mean_ci_low, 'conf_high': predict_mean_ci_upp}

df_intervals = pd.DataFrame(data=data_intervals)

df_intervals_sort = df_intervals.sort(columns='temp')

This achieved desired results:

解决方案

You need to order your predict values based on temperature. I think*

So to get nice curvy lines you will have to use numpy.polynomial.polynomial.polyfit This will return a list of coefficients. You will have to split the x and y data into 2 lists so it fits in the function.

You can then plot this function with:

def strPolynomialFromArray(coeffs):
    return("".join([str(k)+"*x**"+str(n)+"+" for n,k in enumerate(coeffs)])[0:-1])

from numpy import *
from matplotlib.pyplot import *
x = linespace(-15,45,300) # your smooth line will be made of 300 smooth pieces
y = exec(strPolynomialFromArray(numpy.polynomial.polynomial.polyfit(xs,ys,degree)))
plt.plot(x , y)

You can look more into plotting smooth lines here just remember all lines are linear splines, becasue continuous curvature is irrational.

I believe that the polynomial fitting is done with least squares fitting (process described here)

Good Luck!

这篇关于通过重复输入来绘制置信度和预测间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆