Python pandas 线性回归groupby [英] Python pandas linear regression groupby
问题描述
我试图在pandas python dataframe上对一个组进行线性回归:
这是数据框df:
组日期价值
A 01-02-2016 16
A 01-03-2016 15
A 01-04-2016 14
A 01-05-2016 17
A 01-06-2016 19
A 01-07-2016 20
B 01-02-2016 16
B 01 -03-2016 13
B 01-04-2016 13
C 01-02-2016 16
C 01-03-2016 16
#import standard packages
导入熊猫作为pd
导入numpy作为np
#import ML包$ sk $ from bb
$ b #First按组分组数据
df_group = df.groupby('group')
#然后,我们需要将日期更改为整数
df ['date'] = pd .to_datetime(df ['date'])
df ['date_delta'] =(df ['date'] - df ['date']。min())/ np.timedelta64(1,'D')
现在我想预测每个组的值2016年1月10日。
我想要这样一个新的数据框:
group 01-10-2016
预测值
B预测值
C预测值
此如何将统计模型中的OLS应用于groupby 不起作用
用于df_group.groups.keys()中的组:
df = df_group。 get_group(group)
X = df ['date_delta']
y = df ['value']
model = LinearRegression(y,X)
results = model.fit(X ,y)
print results.summary()
我收到以下错误
ValueError:找到的数组样本数不一致:[1 52]
DeprecationWarning:将1d数组作为数据传递在0.17中被弃用,并且在0.19中提高ValueError。如果数据具有单一特征,则使用X.reshape(-1,1)重新整形数据,如果数据包含单个特征,则使用X.reshape(1,-1)重整数据。 >
更新:
我将其更改为
对于df_group.groups.keys()中的组:
df = df_group.get_group(group)
X = df [['date_delta'] ]
y = df.value
model = LinearRegression(y,X)
results = model.fit(X,y)
print results.summary()
现在我得到这个错误:
ValueError:一个Series的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。
解决方案
新回答
def模型(df,delta):
y = df [['value']]。values
X = df [['date_delta' ]]。values
return np.squeeze(LinearRegression()。fit(X,y).predict(delta))
def group_predictions(df,date):
date = pd.to_datetime(date)
df.date = pd.to_datetime(df.date)
$ b $ day = np.timedelta64(1,'D')
mn = df .date.min()
df ['date_delta'] = df.date.sub(mn).div(day)
dd =(date - mn)/ day
return df.groupby('group')。apply(model,delta = dd)
<
$ b $group_predictions(df,'01 -10-p> 2016')
group
A 22.333333333333332
B 3.500000000000007
C 16.0
dtype:object
旧答案
您正在使用
LinearRegr ession
错误。
- 您不会用数据和与数据。只需像这样调用类
model = LinearRegression()
code> model.fit(X,y)
但所有这些都是存储在
model
中的对象的值。没有很好的summary
方法。可能有一个地方,但我知道statsmodels
soooo中的一个,见下文
选项1
使用statsmodels
改为来自statsmodels.formula.api import ols
for k,g in df_group:
model = ols( 'value_date_delta',g)
results = model.fit()
print(results.summary())
OLS回归结果
============= ================================================== ===============
Dep。变量:值R平方:0.652
模型:OLS Adj。 R平方:0.565
方法:最小二乘F统计量:7.500
日期:2017年1月6日星期五Prob(F-statistic):0.0520
时间:10:48:17 Log-可能性:-9.8391
编号观测值:6 AIC:23.68
Df残差:4 BIC:23.26
Df模型:1
协变类型:nonrobust
=== ================================================== =========================
coef std err t P> | t | [95.0%Conf。 Int。]
------------------------------------------- -----------------------------------
拦截14.3333 1.106 12.965 0.000 11.264 17.403
date_delta 1.0000 0.365 2.739 0.052 -0.014 2.014
====================================== ========================================
Omnibus:nan Durbin-Watson :1.393
Prob(Omnibus):nan Jarque-Bera(JB):0.461
偏差:-0.649 Prob(JB):0.794
峰度:2.602条件。第5.78
=========================================== ===================================
警告:
[ 1]标准错误假定正确指定了错误的协方差矩阵。
OLS回归结果
======================================= =======================================
Dep。变量:值R平方:0.750
模型:OLS Adj。 R平方:0.500
方法:最小二乘F统计量:3.000
日期:2017年1月6日星期五Prob(F-statistic):0.333
时间:10:48:17 Log-可能性:-3.2171
编号观测值:3 AIC:10.43
Df残差:1 BIC:8.631
Df模型:1
协变类型:nonrobust
=== ================================================== =========================
coef std err t P> | t | [95.0%Conf。 Int。]
------------------------------------------- -----------------------------------
拦截15.5000 1.118 13.864 0.046 1.294 29.706
date_delta -1.5000 0.866 -1.732 0.333 -12.504 9.504
==================================== ==========================================
Omnibus:nan Durbin -Watson:3.000
Prob(Omnibus):nan Jarque-Bera(JB):0.531
倾斜:-0.707 Prob(JB):0.767
峰度:1.500条件。第2.92
=========================================== ===================================
警告:
[ 1]标准错误假定正确指定了错误的协方差矩阵。
OLS回归结果
======================================= =======================================
Dep。变量:值R平方:-inf
模型:OLS Adj。 R平方:-inf
方法:最小二乘F统计量:-0.000
日期:2017年1月6日星期五Prob(F-统计):nan
时间:10:48:17对数似然度:63.481
编号观测值:2 AIC:-123.0
Df残差:0 BIC:-125.6
Df模型:1
协变类型:nonrobust
================================================== ============================
coef std err t P> | t | [95.0%Conf。 Int。]
------------------------------------------- -----------------------------------
拦截16.0000 inf 0南南南
date_delta -3.553e-15 inf -0 nan nan nan
================================== ============================================
Omnibus: nan Durbin-Watson:0.400
Prob(Omnibus):nan Jarque-Bera(JB):0.333
偏差:0.000 Prob(JB):0.846
峰度:1.000条件。第2.62
=========================================== ===================================
I am trying to use a linear regression on a group by pandas python dataframe:
This is the dataframe df:
group date value A 01-02-2016 16 A 01-03-2016 15 A 01-04-2016 14 A 01-05-2016 17 A 01-06-2016 19 A 01-07-2016 20 B 01-02-2016 16 B 01-03-2016 13 B 01-04-2016 13 C 01-02-2016 16 C 01-03-2016 16 #import standard packages import pandas as pd import numpy as np #import ML packages from sklearn.linear_model import LinearRegression #First, let's group the data by group df_group = df.groupby('group') #Then, we need to change the date to integer df['date'] = pd.to_datetime(df['date']) df['date_delta'] = (df['date'] - df['date'].min()) / np.timedelta64(1,'D')
Now I want to predict the value for each group for 01-10-2016.
I want to get to a new dataframe like this:
group 01-10-2016 A predicted value B predicted value C predicted value
This How to apply OLS from statsmodels to groupby doesn't work
for group in df_group.groups.keys(): df= df_group.get_group(group) X = df['date_delta'] y = df['value'] model = LinearRegression(y, X) results = model.fit(X, y) print results.summary()
I get the following error
ValueError: Found arrays with inconsistent numbers of samples: [ 1 52] DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
UPDATE:
I changed it to
for group in df_group.groups.keys(): df= df_group.get_group(group) X = df[['date_delta']] y = df.value model = LinearRegression(y, X) results = model.fit(X, y) print results.summary()
and now I get this error:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
解决方案New Answer
def model(df, delta): y = df[['value']].values X = df[['date_delta']].values return np.squeeze(LinearRegression().fit(X, y).predict(delta)) def group_predictions(df, date): date = pd.to_datetime(date) df.date = pd.to_datetime(df.date) day = np.timedelta64(1, 'D') mn = df.date.min() df['date_delta'] = df.date.sub(mn).div(day) dd = (date - mn) / day return df.groupby('group').apply(model, delta=dd)
demo
group_predictions(df, '01-10-2016') group A 22.333333333333332 B 3.500000000000007 C 16.0 dtype: object
Old Answer
You're using
LinearRegression
wrong.
- you don't call it with the data and fit with the data. Just call the class like this
model = LinearRegression()
- then
fit
withmodel.fit(X, y)
But all that does is set value in the object stored in model
There is no nice summary
method. There probably is one somewhere, but I know the one in statsmodels
soooo, see below
option 1
use statsmodels
instead
from statsmodels.formula.api import ols
for k, g in df_group:
model = ols('value ~ date_delta', g)
results = model.fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: 0.652
Model: OLS Adj. R-squared: 0.565
Method: Least Squares F-statistic: 7.500
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.0520
Time: 10:48:17 Log-Likelihood: -9.8391
No. Observations: 6 AIC: 23.68
Df Residuals: 4 BIC: 23.26
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 14.3333 1.106 12.965 0.000 11.264 17.403
date_delta 1.0000 0.365 2.739 0.052 -0.014 2.014
==============================================================================
Omnibus: nan Durbin-Watson: 1.393
Prob(Omnibus): nan Jarque-Bera (JB): 0.461
Skew: -0.649 Prob(JB): 0.794
Kurtosis: 2.602 Cond. No. 5.78
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: 0.750
Model: OLS Adj. R-squared: 0.500
Method: Least Squares F-statistic: 3.000
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.333
Time: 10:48:17 Log-Likelihood: -3.2171
No. Observations: 3 AIC: 10.43
Df Residuals: 1 BIC: 8.631
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 15.5000 1.118 13.864 0.046 1.294 29.706
date_delta -1.5000 0.866 -1.732 0.333 -12.504 9.504
==============================================================================
Omnibus: nan Durbin-Watson: 3.000
Prob(Omnibus): nan Jarque-Bera (JB): 0.531
Skew: -0.707 Prob(JB): 0.767
Kurtosis: 1.500 Cond. No. 2.92
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS Regression Results
==============================================================================
Dep. Variable: value R-squared: -inf
Model: OLS Adj. R-squared: -inf
Method: Least Squares F-statistic: -0.000
Date: Fri, 06 Jan 2017 Prob (F-statistic): nan
Time: 10:48:17 Log-Likelihood: 63.481
No. Observations: 2 AIC: -123.0
Df Residuals: 0 BIC: -125.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 16.0000 inf 0 nan nan nan
date_delta -3.553e-15 inf -0 nan nan nan
==============================================================================
Omnibus: nan Durbin-Watson: 0.400
Prob(Omnibus): nan Jarque-Bera (JB): 0.333
Skew: 0.000 Prob(JB): 0.846
Kurtosis: 1.000 Cond. No. 2.62
==============================================================================
这篇关于Python pandas 线性回归groupby的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!