用 plotly 绘制最佳拟合线 [英] Plot best fit line with plotly

查看:93
本文介绍了用 plotly 绘制最佳拟合线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 plotly 的 python 库来绘制时间序列数据的散点图.例如数据:

2015-11-11 12015-11-12 22015-11-14 42015-11-15 22015-11-21 32015-11-22 22015-11-23 3

python 中的代码:

df = pandas.read_csv('~/Data.csv', parse_dates=["date"], header=0)df = df.sort_values(by=['date'], 升序=[真])跟踪 = go.Scatter(x=df['日期'],y=df['score'],模式='标记')fig.append_trace(trace, 2, 2) # 这是一个子图iplot(图)

绘制散点图后,我想在其上绘制最佳拟合线.

plotly 是否以编程方式提供此功能?它来自

解决方案

您提供的代码片段缺少 fig 定义.我更喜欢使用 plotly.graph_objs 但下面的设置你可以选择使用 fig.show()iplot(fig) 来显示你的数字.您将无法仅包含一个参数并自动获得最佳拟合线,但您肯定可以以编程方式获得它.您只需要在原始设置中添加几行就可以了.

剧情:

带有示例数据的完整代码:

将pandas导入为pd导入日期时间将 statsmodels.api 导入为 sm导入 plotly.graph_objs as go从 plotly.offline 导入 iplot# 样本数据df=pd.DataFrame({'date': {0: '2015-11-11',1:'2015-11-12',2:'2015-11-14',3:'2015-11-15',4:'2015-11-21',5:'2015-11-22',6:'2015-11-23'},'分数':{0: 1, 1: 2, 2: 4, 3: 2, 4: 3, 5: 2, 6: 3}})df = df.sort_values(by=['date'], 升序=[真])# 时间序列线性回归的数据df['timestamp']=pd.to_datetime(df['date'])df['serialtime']=[(d-datetime.datetime(1970,1,1)).df['timestamp']] 中 d 的天数x = sm.add_constant(df['serialtime'])模型 = sm.OLS(df['score'], x).fit()df['bestfit']=model.fittedvalues# 情节设置fig=go.Figure()# 源数据fig.add_trace(go.Scatter(x=df['date'],y=df['score'],模式='标记',名称 = '分数'))# 回归数据fig.add_trace(go.Scatter(x=df['date'],y=df['bestfit'],模式='线',名称='最适合',线=字典(颜色=火砖",宽度=2)))iplot(图)

一些细节:

时间序列通常会为线性 OLS 估计带来某些问题.日期本身的格式可能具有挑战性,因此在这种情况下,很容易将数据帧的索引用作自变量.但是由于您的日期不是连续的,简单地用连续系列替换它们会导致错误的回归系数.我经常发现最好使用序列化的整数数组来表示时间序列数据,这意味着每个日期都由一个整数表示,而整数又是某个纪元的天数.在这种情况下 01.01.1970.

这正是我在这里所做的:

df['timestamp']=df['datetime'] = pd.to_datetime(df['date'])df['serialtime'] = [(d- datetime.datetime(1970,1,1)).days for df ['timestamp']]

以下图说明了使用错误数据对 OLS 估计的影响:

I am using plotly's python library to plot a scatter graph of time series data. Eg data :

2015-11-11    1
2015-11-12    2
2015-11-14    4
2015-11-15    2
2015-11-21    3
2015-11-22    2
2015-11-23    3

Code in python:

df = pandas.read_csv('~/Data.csv', parse_dates=["date"], header=0)
df = df.sort_values(by=['date'], ascending=[True])
trace = go.Scatter(
            x=df['date'],
            y=df['score'],
            mode='markers'
)
fig.append_trace(trace, 2, 2)  # It is a subplot
iplot(fig)

Once the scatter plot is plotted, I want to plot a best fit line over this.

Does plotly provide this programmatically? It does from the webapp, but I did not find any documentation about how to do it programmatically. The line in the link is exactly what I want:

解决方案

Your provided code snippet is missing a fig definition. I prefer using plotly.graph_objs but the with setup below you can chose to show your figures using fig.show() or iplot(fig). You won't be able to just include an argument and get a best fit line automaticaly, but you sure can get this programmatically. You'll just need to add a couple of lines to you original setup and you're good to go.

Plot:

Complete code with sample data:

import pandas as pd
import datetime
import statsmodels.api as sm
import plotly.graph_objs as go
from plotly.offline import iplot

# sample data
df=pd.DataFrame({'date': {0: '2015-11-11',
                      1: '2015-11-12',
                      2: '2015-11-14',
                      3: '2015-11-15',
                      4: '2015-11-21',
                      5: '2015-11-22',
                      6: '2015-11-23'},
                     'score': {0: 1, 1: 2, 2: 4, 3: 2, 4: 3, 5: 2, 6: 3}})

df = df.sort_values(by=['date'], ascending=[True])

# data for time series linear regression
df['timestamp']=pd.to_datetime(df['date'])
df['serialtime']=[(d-datetime.datetime(1970,1,1)).days for d in df['timestamp']]

x = sm.add_constant(df['serialtime'])
model = sm.OLS(df['score'], x).fit()
df['bestfit']=model.fittedvalues

# plotly setup
fig=go.Figure()

# source data
fig.add_trace(go.Scatter(x=df['date'],
                         y=df['score'],
                         mode='markers',
                         name = 'score')
             )

# regression data
fig.add_trace(go.Scatter(x=df['date'],
                         y=df['bestfit'],
                         mode='lines',
                         name='best fit',
                         line=dict(color='firebrick', width=2)
                        ))

iplot(fig)

Some details:

Time series often present certain issues for linear OLS estimation. The format of the dates themselves can be challenging, so in this case it would be tempting to use the index of your dataframe as an independent variable. But since your dates are not continuous, simply replacing them with a continous series would result in erroneous regression coefficients. I often find it best to use a serialized integer array to represent time series data, meaning that each date is represented by an integer which in turn is the count ouf days from some epoch. In this case 01.01.1970.

And that's exactly what I'm doing here:

df['timestamp']=df['datetime'] = pd.to_datetime(df['date'])
df['serialtime'] = [(d- datetime.datetime(1970,1,1)).days for d in df['timestamp']]

Here's a plot that illustrates the effects on your OLS estimates by using the wrong data:

这篇关于用 plotly 绘制最佳拟合线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆