使用Pandas数据框进行简单的线性回归 [英] Simple linear regression using pandas dataframe

查看:1620
本文介绍了使用Pandas数据框进行简单的线性回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在查看许多实体(SysNr)的趋势

I'm looking to check trends for a number of entities (SysNr)

我有3年(2014、2015、2016)的数据

I have data spanning 3 years (2014,2015,2016)

我正在查看大量变量,但会将这个问题限制为一个('res_f_r')

I'm looking at a large quantity of variables, but will limit this question to one ('res_f_r')

我的DataFrame看起来像这样

My DataFrame looks something like this

d = [
    {'RegnskabsAar': 2014, 'SysNr': 1, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 1, 'res_f_r': 400000},
    {'RegnskabsAar': 2016, 'SysNr': 1, 'res_f_r': 450000},
    {'RegnskabsAar': 2014, 'SysNr': 2, 'res_f_r': 350000},
    {'RegnskabsAar': 2015, 'SysNr': 2, 'res_f_r': 300000},
    {'RegnskabsAar': 2016, 'SysNr': 2, 'res_f_r': 250000},
]

df = pd.DataFrame(d)



   RegnskabsAar  SysNr  res_f_r
0          2014      1   350000
1          2015      1   400000
2          2016      1   450000
3          2014      2   350000
4          2015      2   300000
5          2016      2   250000

我的愿望是对每个实体(SysNr)进行线性回归,并获得斜率和截距

My desire is to do a linear regression on each entity (SysNr) and get returned the slope and intercept

我期望的输出是

   SysNr  intercept  slope
0      1     300000  50000
1      2     400000 -50000

有什么想法吗?

推荐答案

所以我不知道为什么我们的截距值会有所不同(也许我犯了一个错误,或者您给定的数据不是您希望处理的完整数据) ,但我建议您使用 np.polyfit 或您选择的工具( scikit-learn scipy.stats.linregress ,...)与groupby组合并应用:

So I don't know why our intercept values differ (maybe I have made a mistake or your given data is not the full data you expect to work on), but I'd suggest you to use np.polyfit or the tool of your choice (scikit-learn, scipy.stats.linregress, ...) in combination with groupby and apply:

In [25]: df.groupby("SysNr").apply(lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1))
Out[25]:
SysNr
1    [49999.99999999048, -100349999.99998075]
2    [-49999.99999999045, 101049999.99998072]
dtype: object

在那之后,美化它:

In [43]: df.groupby("SysNr").apply(
    ...:     lambda g: np.polyfit(g.RegnskabsAar, g.res_f_r, 1)).apply(
    ...:     pd.Series).rename(columns={0:'slope', 1:'intercept'}).reset_index()
Out[43]:
   SysNr    slope     intercept
0      1  50000.0 -1.003500e+08
1      2 -50000.0  1.010500e+08

因为您在评论中的另一个答案上询问了如何处理某些SysNr的缺失年份: 只需将该NaNs删除即可进行有效的线性回归.当然,您也可以根据您要实现的目标,用平均值左右的方式填充它们,但是从我的角度来看,这并没有帮助.

Because you asked on the other answer in the comment about how to handle missing years for some SysNr: Just drop that NaNs for a valid linear regression. Of course you could also fill them with the mean or so, depending on what you want to achieve, but that isn't that helpful from my point of view.

如果实体只有一年的数据,则不能对它进行线性回归.但是您可以(如果需要并且适合您的情况,请在需要时提供有关数据的更多信息)以某种方式将其他实体的斜率外推到该实体并计算截距.为此,您必须对实体的斜率分布进行一些假设(例如,线性,则sysNr 3的斜率将为-150000.0).

If the entity has only data for one year, you can't apply a linear regression on that usefully. But you can (if you want and that fits your case, please provide more information on the data if needed) extrapolate somehow the slope of the other entities to this one and calculate the intercept. For that of course you must make some assumptions on the distribution of the slope of the entities (e.g. linear, then the slope of sysNr 3 would be -150000.0).

这篇关于使用Pandas数据框进行简单的线性回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆