Python Pandas:如何按组运行多个单变量回归 [英] Python pandas: how to run multiple univariate regression by group
问题描述
假设我有一个DataFrame
,其中一列y
变量和许多列x
变量.我希望能够运行y
与x1
,y
与x2
,...等的多个单变量回归,并将预测存储回DataFrame
中.我还需要通过组变量来完成此操作.
Suppose I have a DataFrame
with one column of y
variable and many columns of x
variables. I would like to be able to run multiple univariate regressions of y
vs x1
, y
vs x2
, ..., etc, and store the predictions back into the DataFrame
. Also I need to do this by a group variable.
import statsmodels.api as sm
import pandas as pd
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'grp': ['a', 'b'] * 10})
def ols_res(x, y):
return sm.OLS(y, x).fit().predict()
df.groupby('grp').apply(ols_res) # This does not work
上面的代码显然不起作用.我不清楚如何在使apply
遍历x
列(x1
,x2
,...)时正确地将固定的y
传递给函数.我怀疑可能会有一个非常聪明的单行解决方案来做到这一点.有什么主意吗?
The code above obviously does not work. It is not clear to me how to correctly pass the fixed y
to the function while having apply
iterating through the x
columns(x1
, x2
, ...). I suspect there might be a very clever one-line solution to do this. Any idea?
推荐答案
传递给apply
的函数必须以pandas.DataFrame
作为第一个参数.您可以将其他关键字或位置参数传递给apply
,以传递给所应用的函数.因此,您的示例将进行少量修改即可工作.将ols_res
更改为
The function you pass to apply
must take a pandas.DataFrame
as a first argument. You can pass additional keyword or positional arguments to apply
that get passed to the applied function. So your example would work with a small modification. Change ols_res
to
def ols_res(df, xcols, ycol):
return sm.OLS(df[ycol], df[xcols]).fit().predict()
然后,您可以像这样使用groupby
和apply
Then, you can use groupby
and apply
like this
df.groupby('grp').apply(ols_res, xcols=['x1', 'x2'], ycol='y')
或
df.groupby('grp').apply(ols_res, ['x1', 'x2'], 'y')
编辑
上面的代码 not 不能运行多个 univariate 回归.相反,它每组运行一个 multivariate 回归.但是,(再进行一次)稍作修改即可.
The above code does not run multiple univariate regressions. Instead, it runs one multivariate regression per group. With (another) slight modification it will, however.
def ols_res(df, xcols, ycol):
return pd.DataFrame({xcol : sm.OLS(df[ycol], df[xcol]).fit().predict() for xcol in xcols})
编辑2
尽管上述解决方案有效,但我认为以下内容更适合熊猫
Although, the above solution works, I think the following is a little more pandas-y
import statsmodels.api as sm
import pandas as pd
import numpy as np
df = pd.DataFrame({
'y': np.random.randn(20),
'x1': np.random.randn(20),
'x2': np.random.randn(20),
'grp': ['a', 'b'] * 10})
def ols_res(x, y):
return pd.Series(sm.OLS(y, x).fit().predict())
df.groupby('grp').apply(lambda x : x[['x1', 'x2']].apply(ols_res, y=x['y']))
由于某种原因,如果我按原样定义ols_res()
,则结果DataFrame
在索引中没有组标签.
For some reason, if I define ols_res()
as it was originally, the resultant DataFrame
doesn't have the group label in the index.
这篇关于Python Pandas:如何按组运行多个单变量回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!