Python pandas 线性回归groupby [英] Python pandas linear regression groupby

查看:219
本文介绍了Python pandas 线性回归groupby的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在pandas python dataframe上对一个组进行线性回归:



这是数据框df:

 组日期价值
A 01-02-2016 16
A 01-03-2016 15
A 01-04-2016 14
A 01-05-2016 17
A 01-06-2016 19
A 01-07-2016 20
B 01-02-2016 16
B 01 -03-2016 13
B 01-04-2016 13
C 01-02-2016 16
C 01-03-2016 16

#import standard packages
导入熊猫作为pd
导入numpy作为np

#import ML包$ sk $ from bb


$ b #First按组分组数据
df_group = df.groupby('group')

#然后,我们需要将日期更改为整数
df ['date'] = pd .to_datetime(df ['date'])
df ['date_delta'] =(df ['date'] - df ['date']。min())/ np.timedelta64(1,'D')

现在我想预测每个组的值2016年1月10日。



我想要这样一个新的数据框:

  group 01-10-2016 
预测值
B预测值
C预测值

如何将统计模型中的OLS应用于groupby 不起作用

 用于df_group.groups.keys()中的组:
df = df_group。 get_group(group)
X = df ['date_delta']
y = df ['value']
model = LinearRegression(y,X)
results = model.fit(X ,y)
print results.summary()

我收到以下错误

  ValueError:找到的数组样本数不一致:[1 52] 

DeprecationWarning:将1d数组作为数据传递在0.17中被弃用,并且在0.19中提高ValueError。如果数据具有单一特征,则使用X.reshape(-1,1)重新整形数据,如果数据包含单个特征,则使用X.reshape(1,-1)重整数据。 >



更新:



我将其更改为

 对于df_group.groups.keys()中的组:
df = df_group.get_group(group)
X = df [['date_delta'] ]
y = df.value
model = LinearRegression(y,X)
results = model.fit(X,y)
print results.summary()

现在我得到这个错误:

  ValueError:一个Series的真值是不明确的。使用a.empty,a.bool(),a.item(),a.any()或a.all()。 


解决方案

新回答



  def模型(df,delta):
y = df [['value']]。values
X = df [['date_delta' ]]。values
return np.squeeze(LinearRegression()。fit(X,y).predict(delta))

def group_predictions(df,date):
date = pd.to_datetime(date)
df.date = pd.to_datetime(df.date)
$ b $ day = np.timedelta64(1,'D')
mn = df .date.min()
df ['date_delta'] = df.date.sub(mn).div(day)

dd =(date - mn)/ day

return df.groupby('group')。apply(model,delta = dd)



<
$ b $

  group_predictions(df,'01 -10-p>   2016')

group
A 22.333333333333332
B 3.500000000000007
C 16.0
dtype:object



旧答案



您正在使用 LinearRegr ession 错误。




  • 您不会用数据与数据。只需像这样调用类


  • model = LinearRegression()

  • code> model.fit(X,y)




但所有这些都是存储在 model 中的对象的值。没有很好的 summary 方法。可能有一个地方,但我知道 statsmodels soooo中的一个,见下文






选项1

使用 statsmodels 改为

 来自statsmodels.formula.api import ols 

for k,g in df_group:
model = ols( 'value_date_delta',g)
results = model.fit()
print(results.summary())






  OLS回归结果
============= ================================================== ===============
Dep。变量:值R平方:0.652
模型:OLS Adj。 R平方:0.565
方法:最小二乘F统计量:7.500
日期:2017年1月6日星期五Prob(F-statistic):0.0520
时间:10:48:17 Log-可能性:-9.8391
编号观测值:6 AIC:23.68
Df残差:4 BIC:23.26
Df模型:1
协变类型:nonrobust
=== ================================================== =========================
coef std err t P> | t | [95.0%Conf。 Int。]
------------------------------------------- -----------------------------------
拦截14.3333 1.106 12.965 0.000 11.264 17.403
date_delta 1.0000 0.365 2.739 0.052 -0.014 2.014
====================================== ========================================
Omnibus:nan Durbin-Watson :1.393
Prob(Omnibus):nan Jarque-Bera(JB):0.461
偏差:-0.649 Prob(JB):0.794
峰度:2.602条件。第5.78
=========================================== ===================================

警告:
[ 1]标准错误假定正确指定了错误的协方差矩阵。
OLS回归结果
======================================= =======================================
Dep。变量:值R平方:0.750
模型:OLS Adj。 R平方:0.500
方法:最小二乘F统计量:3.000
日期:2017年1月6日星期五Prob(F-statistic):0.333
时间:10:48:17 Log-可能性:-3.2171
编号观测值:3 AIC:10.43
Df残差:1 BIC:8.631
Df模型:1
协变类型:nonrobust
=== ================================================== =========================
coef std err t P> | t | [95.0%Conf。 Int。]
------------------------------------------- -----------------------------------
拦截15.5000 1.118 13.864 0.046 1.294 29.706
date_delta -1.5000 0.866 -1.732 0.333 -12.504 9.504
==================================== ==========================================
Omnibus:nan Durbin -Watson:3.000
Prob(Omnibus):nan Jarque-Bera(JB):0.531
倾斜:-0.707 Prob(JB):0.767
峰度:1.500条件。第2.92
=========================================== ===================================

警告:
[ 1]标准错误假定正确指定了错误的协方差矩阵。
OLS回归结果
======================================= =======================================
Dep。变量:值R平方:-inf
模型:OLS Adj。 R平方:-inf
方法:最小二乘F统计量:-0.000
日期:2017年1月6日星期五Prob(F-统计):nan
时间:10:48:17对数似然度:63.481
编号观测值:2 AIC:-123.0
Df残差:0 BIC:-125.6
Df模型:1
协变类型:nonrobust
================================================== ============================
coef std err t P> | t | [95.0%Conf。 Int。]
------------------------------------------- -----------------------------------
拦截16.0000 inf 0南南南
date_delta -3.553e-15 inf -0 nan nan nan
================================== ============================================
Omnibus: nan Durbin-Watson:0.400
Prob(Omnibus):nan Jarque-Bera(JB):0.333
偏差:0.000 Prob(JB):0.846
峰度:1.000条件。第2.62
=========================================== ===================================


I am trying to use a linear regression on a group by pandas python dataframe:

This is the dataframe df:

  group      date      value
    A     01-02-2016     16 
    A     01-03-2016     15 
    A     01-04-2016     14 
    A     01-05-2016     17 
    A     01-06-2016     19 
    A     01-07-2016     20 
    B     01-02-2016     16 
    B     01-03-2016     13 
    B     01-04-2016     13 
    C     01-02-2016     16 
    C     01-03-2016     16 

#import standard packages
import pandas as pd
import numpy as np

#import ML packages
from sklearn.linear_model import LinearRegression

#First, let's group the data by group
df_group = df.groupby('group')

#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date'])  
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')

Now I want to predict the value for each group for 01-10-2016.

I want to get to a new dataframe like this:

group      01-10-2016
  A      predicted value
  B      predicted value
  C      predicted value

This How to apply OLS from statsmodels to groupby doesn't work

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df['date_delta'] 
      y = df['value']
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()

I get the following error

ValueError: Found arrays with inconsistent numbers of samples: [ 1 52]

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and   willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)

UPDATE:

I changed it to

for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df[['date_delta']]
      y = df.value
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()

and now I get this error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

解决方案

New Answer

def model(df, delta):
    y = df[['value']].values
    X = df[['date_delta']].values
    return np.squeeze(LinearRegression().fit(X, y).predict(delta))

def group_predictions(df, date):
    date = pd.to_datetime(date)
    df.date = pd.to_datetime(df.date)

    day = np.timedelta64(1, 'D')
    mn = df.date.min()
    df['date_delta'] = df.date.sub(mn).div(day)

    dd = (date - mn) / day

    return df.groupby('group').apply(model, delta=dd)

demo

group_predictions(df, '01-10-2016')

group
A    22.333333333333332
B     3.500000000000007
C                  16.0
dtype: object

Old Answer

You're using LinearRegression wrong.

  • you don't call it with the data and fit with the data. Just call the class like this
    • model = LinearRegression()
  • then fit with
    • model.fit(X, y)

But all that does is set value in the object stored in model There is no nice summary method. There probably is one somewhere, but I know the one in statsmodels soooo, see below


option 1
use statsmodels instead

from statsmodels.formula.api import ols

for k, g in df_group:
    model = ols('value ~ date_delta', g)
    results = model.fit()
    print(results.summary())


                        OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.652
Model:                            OLS   Adj. R-squared:                  0.565
Method:                 Least Squares   F-statistic:                     7.500
Date:                Fri, 06 Jan 2017   Prob (F-statistic):             0.0520
Time:                        10:48:17   Log-Likelihood:                -9.8391
No. Observations:                   6   AIC:                             23.68
Df Residuals:                       4   BIC:                             23.26
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.3333      1.106     12.965      0.000        11.264    17.403
date_delta     1.0000      0.365      2.739      0.052        -0.014     2.014
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.393
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.461
Skew:                          -0.649   Prob(JB):                        0.794
Kurtosis:                       2.602   Cond. No.                         5.78
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.500
Method:                 Least Squares   F-statistic:                     3.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):              0.333
Time:                        10:48:17   Log-Likelihood:                -3.2171
No. Observations:                   3   AIC:                             10.43
Df Residuals:                       1   BIC:                             8.631
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     15.5000      1.118     13.864      0.046         1.294    29.706
date_delta    -1.5000      0.866     -1.732      0.333       -12.504     9.504
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.000
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.531
Skew:                          -0.707   Prob(JB):                        0.767
Kurtosis:                       1.500   Cond. No.                         2.92
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -0.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):                nan
Time:                        10:48:17   Log-Likelihood:                 63.481
No. Observations:                   2   AIC:                            -123.0
Df Residuals:                       0   BIC:                            -125.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     16.0000        inf          0        nan           nan       nan
date_delta -3.553e-15        inf         -0        nan           nan       nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.400
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         2.62
==============================================================================

这篇关于Python pandas 线性回归groupby的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆