Python Pandas:如何按组运行多个单变量回归 [英] Python pandas: how to run multiple univariate regression by group

查看:137
本文介绍了Python Pandas:如何按组运行多个单变量回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个DataFrame,其中一列y变量和许多列x变量.我希望能够运行yx1yx2,...等的多个单变量回归,并将预测存储回DataFrame中.我还需要通过组变量来完成此操作.

Suppose I have a DataFrame with one column of y variable and many columns of x variables. I would like to be able to run multiple univariate regressions of y vs x1, y vs x2, ..., etc, and store the predictions back into the DataFrame. Also I need to do this by a group variable.

import statsmodels.api as sm
import pandas as pd

df = pd.DataFrame({
  'y': np.random.randn(20),
  'x1': np.random.randn(20), 
  'x2': np.random.randn(20),
  'grp': ['a', 'b'] * 10})

def ols_res(x, y):
    return sm.OLS(y, x).fit().predict()

df.groupby('grp').apply(ols_res) # This does not work

上面的代码显然不起作用.我不清楚如何在使apply遍历x列(x1x2,...)时正确地将固定的y传递给函数.我怀疑可能会有一个非常聪明的单行解决方案来做到这一点.有什么主意吗?

The code above obviously does not work. It is not clear to me how to correctly pass the fixed y to the function while having apply iterating through the x columns(x1, x2, ...). I suspect there might be a very clever one-line solution to do this. Any idea?

推荐答案

传递给apply的函数必须以pandas.DataFrame作为第一个参数.您可以将其他关键字或位置参数传递给apply,以传递给所应用的函数.因此,您的示例将进行少量修改即可工作.将ols_res更改为

The function you pass to apply must take a pandas.DataFrame as a first argument. You can pass additional keyword or positional arguments to apply that get passed to the applied function. So your example would work with a small modification. Change ols_res to

def ols_res(df, xcols,  ycol):
    return sm.OLS(df[ycol], df[xcols]).fit().predict()

然后,您可以像这样使用groupbyapply

Then, you can use groupby and apply like this

df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')

df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')

编辑

上面的代码 not 不能运行多个 univariate 回归.相反,它每组运行一个 multivariate 回归.但是,(再进行一次)稍作修改即可.

The above code does not run multiple univariate regressions. Instead, it runs one multivariate regression per group. With (another) slight modification it will, however.

def ols_res(df, xcols,  ycol):
    return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})

编辑2

尽管上述解决方案有效,但我认为以下内容更适合熊猫

Although, the above solution works, I think the following is a little more pandas-y

import statsmodels.api as sm
import pandas as pd
import numpy as np

df = pd.DataFrame({
  'y': np.random.randn(20),
  'x1': np.random.randn(20), 
  'x2': np.random.randn(20),
  'grp': ['a', 'b'] * 10})

def ols_res(x, y):
    return pd.Series(sm.OLS(y, x).fit().predict())

df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))

由于某种原因,如果我按原样定义ols_res(),则结果DataFrame在索引中没有组标签.

For some reason, if I define ols_res() as it was originally, the resultant DataFrame doesn't have the group label in the index.

这篇关于Python Pandas:如何按组运行多个单变量回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆