Python pandas 线性回归groupby [英] Python pandas linear regression groupby

查看：219 发布时间：2018/5/30 13:57:06 python pandas dataframe group-by linear-regression

本文介绍了Python pandas 线性回归groupby的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图在pandas python dataframe上对一个组进行线性回归：

这是数据框df：

 组日期价值
 A 01-02-2016 16 
 A 01-03-2016 15 
 A 01-04-2016 14 
 A 01-05-2016 17 
 A 01-06-2016 19 
 A 01-07-2016 20 
 B 01-02-2016 16 
 B 01 -03-2016 13 
 B 01-04-2016 13 
 C 01-02-2016 16 
 C 01-03-2016 16 
 
 #import standard packages 
导入熊猫作为pd 
导入numpy作为np 
 
 #import ML包$ sk $ from bb 
 
 
 $ b #First按组分组数据
 df_group = df.groupby（'group'）
 
＃然后，我们需要将日期更改为整数
 df ['date'] = pd .to_datetime（df ['date']）
 df ['date_delta'] =（df ['date']  -  df ['date']。min（））/ np.timedelta64（1，'D'）

现在我想预测每个组的值2016年1月10日。

我想要这样一个新的数据框：

  group 01-10-2016 
预测值
 B预测值
 C预测值

此如何将统计模型中的OLS应用于groupby 不起作用

 用于df_group.groups.keys（）中的组：
 df = df_group。 get_group（group）
 X = df ['date_delta'] 
y = df ['value'] 
 model = LinearRegression（y，X）
 results = model.fit（X ，y）
 print results.summary（）

我收到以下错误

  ValueError：找到的数组样本数不一致：[1 52] 
 
 DeprecationWarning：将1d数组作为数据传递在0.17中被弃用，并且在0.19中提高ValueError。如果数据具有单一特征，则使用X.reshape（-1，1）重新整形数据，如果数据包含单个特征，则使用X.reshape（1，-1）重整数据。 > 
 
 
 更新：
 
 
 我将其更改为
 对于df_group.groups.keys（）中的组：
 df = df_group.get_group（group）
 X = df [['date_delta'] ] 
y = df.value 
 model = LinearRegression（y，X）
 results = model.fit（X，y）
 print results.summary（）
  
现在我得到这个错误： 
 
 
  ValueError：一个Series的真值是不明确的。使用a.empty，a.bool（），a.item（），a.any（）或a.all（）。 
  
 
 
解决方案
 
新回答
 
 
 
  def模型（df，delta）：
y = df [['value']]。values 
 X = df [['date_delta' ]]。values 
 return np.squeeze（LinearRegression（）。fit（X，y）.predict（delta））
 
 def group_predictions（df，date）：
 date = pd.to_datetime（date）
 df.date = pd.to_datetime（df.date）
 $ b $ day = np.timedelta64（1，'D'）
 mn = df .date.min（）
 df ['date_delta'] = df.date.sub（mn）.div（day）
 
 dd =（date  -  mn）/ day 
 
 return df.groupby（'group'）。apply（model，delta = dd）
  
 
 
 < 
 $ b $ 
  group_predictions（df，'01 -10-p>   2016'）
 
 group 
 A 22.333333333333332 
 B 3.500000000000007 
 C 16.0 
 dtype：object 
  
 
 
 
旧答案
 
 
 您正在使用 LinearRegr ession 错误。
 
 
  
 您不会用数据和与数据。只需像这样调用类
 
 
 
  model = LinearRegression（） 
 
             code> model.fit（X，y）  
   
 
 
 
 但所有这些都是存储在 model 中的对象的值。没有很好的 summary 方法。可能有一个地方，但我知道 statsmodels  soooo中的一个，见下文
 
 
 
 
 
   选项1   
 
使用 statsmodels 改为
 来自statsmodels.formula.api import ols 
 
 for k，g in df_group：
 model = ols（ 'value_date_delta'，g）
 results = model.fit（）
 print（results.summary（））
  
 
 
 
 
 
 
  OLS回归结果
 ============= ================================================== =============== 
 Dep。变量：值R平方：0.652 
模型：OLS Adj。 R平方：0.565 
方法：最小二乘F统计量：7.500 
日期：2017年1月6日星期五Prob（F-statistic）：0.0520 
时间：10:48:17 Log-可能性：-9.8391 
编号观测值：6 AIC：23.68 
 Df残差：4 BIC：23.26 
 Df模型：1 
协变类型：nonrobust 
 === ================================================== ========================= 
 coef std err t P> | t | [95.0％Conf。 Int。] 
 ------------------------------------------- ----------------------------------- 
拦截14.3333 1.106 12.965 0.000 11.264 17.403 
 date_delta 1.0000 0.365 2.739 0.052 -0.014 2.014 
 ====================================== ======================================== 
 Omnibus：nan Durbin-Watson ：1.393 
 Prob（Omnibus）：nan Jarque-Bera（JB）：0.461 
偏差：-0.649 Prob（JB）：0.794 
峰度：2.602条件。第5.78 
 =========================================== =================================== 
 
警告：
 [ 1]标准错误假定正确指定了错误的协方差矩阵。 
 OLS回归结果
 ======================================= ======================================= 
 Dep。变量：值R平方：0.750 
模型：OLS Adj。 R平方：0.500 
方法：最小二乘F统计量：3.000 
日期：2017年1月6日星期五Prob（F-statistic）：0.333 
时间：10:48:17 Log-可能性：-3.2171 
编号观测值：3 AIC：10.43 
 Df残差：1 BIC：8.631 
 Df模型：1 
协变类型：nonrobust 
 === ================================================== ========================= 
 coef std err t P> | t | [95.0％Conf。 Int。] 
 ------------------------------------------- ----------------------------------- 
拦截15.5000 1.118 13.864 0.046 1.294 29.706 
 date_delta -1.5000 0.866 -1.732 0.333 -12.504 9.504 
 ==================================== ========================================== 
 Omnibus：nan Durbin -Watson：3.000 
 Prob（Omnibus）：nan Jarque-Bera（JB）：0.531 
倾斜：-0.707 Prob（JB）：0.767 
峰度：1.500条件。第2.92 
 =========================================== =================================== 
 
警告：
 [ 1]标准错误假定正确指定了错误的协方差矩阵。 
 OLS回归结果
 ======================================= ======================================= 
 Dep。变量：值R平方：-inf 
模型：OLS Adj。 R平方：-inf 
方法：最小二乘F统计量：-0.000 
日期：2017年1月6日星期五Prob（F-统计）：nan 
时间：10:48:17对数似然度：63.481 
编号观测值：2 AIC：-123.0 
 Df残差：0 BIC：-125.6 
 Df模型：1 
协变类型：nonrobust 
 ================================================== ============================ 
 coef std err t P> | t | [95.0％Conf。 Int。] 
 ------------------------------------------- ----------------------------------- 
拦截16.0000 inf 0南南南
 date_delta -3.553e-15 inf -0 nan nan nan 
 ================================== ============================================ 
 Omnibus： nan Durbin-Watson：0.400 
 Prob（Omnibus）：nan Jarque-Bera（JB）：0.333 
偏差：0.000 Prob（JB）：0.846 
峰度：1.000条件。第2.62 
 =========================================== =================================== 
  
 
I am trying to use a linear regression on a group by pandas python dataframe: 

This is the dataframe df:
  group      date      value
    A     01-02-2016     16 
    A     01-03-2016     15 
    A     01-04-2016     14 
    A     01-05-2016     17 
    A     01-06-2016     19 
    A     01-07-2016     20 
    B     01-02-2016     16 
    B     01-03-2016     13 
    B     01-04-2016     13 
    C     01-02-2016     16 
    C     01-03-2016     16 

#import standard packages
import pandas as pd
import numpy as np

#import ML packages
from sklearn.linear_model import LinearRegression

#First, let's group the data by group
df_group = df.groupby('group')

#Then, we need to change the date to integer
df['date'] = pd.to_datetime(df['date'])  
df['date_delta'] = (df['date'] - df['date'].min())  / np.timedelta64(1,'D')
Now I want to predict the value for each group for 01-10-2016. 

I want to get to a new dataframe like this: 
group      01-10-2016
  A      predicted value
  B      predicted value
  C      predicted value
This How to apply OLS from statsmodels to groupby doesn't work
for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df['date_delta'] 
      y = df['value']
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()
I get the following error
ValueError: Found arrays with inconsistent numbers of samples: [ 1 52]

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and   willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.DeprecationWarning)
UPDATE: 

I changed it to 
for group in df_group.groups.keys():
      df= df_group.get_group(group)
      X = df[['date_delta']]
      y = df.value
      model = LinearRegression(y, X)
      results = model.fit(X, y)
      print results.summary()
and now I get this error: 
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

 解决方案 
New Answer

def model(df, delta):
    y = df[['value']].values
    X = df[['date_delta']].values
    return np.squeeze(LinearRegression().fit(X, y).predict(delta))

def group_predictions(df, date):
    date = pd.to_datetime(date)
    df.date = pd.to_datetime(df.date)

    day = np.timedelta64(1, 'D')
    mn = df.date.min()
    df['date_delta'] = df.date.sub(mn).div(day)

    dd = (date - mn) / day

    return df.groupby('group').apply(model, delta=dd)
demo  
group_predictions(df, '01-10-2016')

group
A    22.333333333333332
B     3.500000000000007
C                  16.0
dtype: object


Old Answer

You're using LinearRegression wrong.


you don't call it with the data and fit with the data.  Just call the class like this


model = LinearRegression()

then fit with


model.fit(X, y)



But all that does is set value in the object stored in model  There is no nice summary method.  There probably is one somewhere, but I know the one in statsmodels soooo, see below



option 1

use statsmodels instead
from statsmodels.formula.api import ols

for k, g in df_group:
    model = ols('value ~ date_delta', g)
    results = model.fit()
    print(results.summary())




                        OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.652
Model:                            OLS   Adj. R-squared:                  0.565
Method:                 Least Squares   F-statistic:                     7.500
Date:                Fri, 06 Jan 2017   Prob (F-statistic):             0.0520
Time:                        10:48:17   Log-Likelihood:                -9.8391
No. Observations:                   6   AIC:                             23.68
Df Residuals:                       4   BIC:                             23.26
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     14.3333      1.106     12.965      0.000        11.264    17.403
date_delta     1.0000      0.365      2.739      0.052        -0.014     2.014
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   1.393
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.461
Skew:                          -0.649   Prob(JB):                        0.794
Kurtosis:                       2.602   Cond. No.                         5.78
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                       0.750
Model:                            OLS   Adj. R-squared:                  0.500
Method:                 Least Squares   F-statistic:                     3.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):              0.333
Time:                        10:48:17   Log-Likelihood:                -3.2171
No. Observations:                   3   AIC:                             10.43
Df Residuals:                       1   BIC:                             8.631
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     15.5000      1.118     13.864      0.046         1.294    29.706
date_delta    -1.5000      0.866     -1.732      0.333       -12.504     9.504
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.000
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.531
Skew:                          -0.707   Prob(JB):                        0.767
Kurtosis:                       1.500   Cond. No.                         2.92
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  value   R-squared:                        -inf
Model:                            OLS   Adj. R-squared:                   -inf
Method:                 Least Squares   F-statistic:                    -0.000
Date:                Fri, 06 Jan 2017   Prob (F-statistic):                nan
Time:                        10:48:17   Log-Likelihood:                 63.481
No. Observations:                   2   AIC:                            -123.0
Df Residuals:                       0   BIC:                            -125.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     16.0000        inf          0        nan           nan       nan
date_delta -3.553e-15        inf         -0        nan           nan       nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.400
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         2.62
==============================================================================


                        
这篇关于Python pandas 线性回归groupby的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python pandas 线性回归groupby [英] Python pandas linear regression groupby

问题描述

新回答

旧答案

New Answer

Old Answer

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python pandas 线性回归groupby [英] Python pandas linear regression groupby

问题描述

新回答

旧答案

New Answer

Old Answer

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭