有效地对 pandas 列的多个子集进行回归分析 [英] Run regression analysis on multiple subsets of pandas columns efficiently

查看:80
本文介绍了有效地对 pandas 列的多个子集进行回归分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我本可以选择一个较短的问题,只关注这里的核心问题,即列表排列.但是我将 statsmodels pandas 引入问题的原因是,可能存在用于逐步回归的特定工具,同时可以灵活地存储所需的回归输出(如我将在下面向您展示的那样),但是效率更高.至少我希望如此.

I could have chosen to go for a shorter question that only focuses on the core-problem here that is list permutations. But the reason I'm bringing statsmodels and pandas into the question is that there may exist specific tools for step-wise regression that at the same time has the flexibilty of storing the desired regression output like I'm about to show you below, but that are much more efficient. At least I hope so.

给出如下数据框:

代码段1:

# Imports
import pandas as pd
import numpy as np
import itertools
import statsmodels.api as sm

# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars) 
df_1 = df_1.set_index(rng)

print(df_1)

屏幕截图1:

我想使用自变量x1,x2和x3的多种组合对因变量y进行多次回归分析.换句话说,这是一个逐步回归分析,其中y是针对x1进行检验的,然后分别针对x2和x3进行检验.然后针对x1和x2的集合对y进行测试,依此类推:

I'd like to run several regression anlyses on the dependent variable y using multiple combinations of the independent variables x1, x2 and x3. In other words, this is a step-wise regression analysis where y is tested against x1, and then x2 and x3 consequtively. Then y is tested against the set of x1 AND x2, and so on like this:

  1. ['y',['x1']]
  2. ['y',['x2']]
  3. ['y',['x3']]
  4. ['y',['x1','x2']]
  5. ['y',['x1','x2','x3']]

我效率低下的方法:

在下面的前两个代码段中,我可以通过使用列表列表对执行序列进行硬编码来精确地做到这一点.

In the two first snippet belows, I'm able to do exactly this by hardcoding the execution sequence using a list of lists.

以下是listVars的子集:

Here are the subsets of listVars:

代码段2:

listExec = [[listVars[0], listVars[1:2]],
       [listVars[0], listVars[2:3]],
       [listVars[0], listVars[3:4]],
       [listVars[0], listVars[1:3]],
       [listVars[0], listVars[1:4]]]

for l in listExec:
    print(l)

截屏2:

使用listExec,我可以设置一个回归分析过程,并在这样的列表中存储一堆结果(rsquared或整个模型输出mode.summary()):

With listExec I can set up a procedure for regression analysis and get store a bunch of results (rsquared or the entire model output mode.summary()) in a list like this:

代码段3:

allResults = []
for l in listExec:
    x = listVars[1]
    x = sm.add_constant(df_1[l[1]])
    model = sm.OLS(df_1[l[0]], x).fit()    
       result = model.rsquared
    allResults.append(result)

打印(allResults)

print(allResults)

截屏3:

这太可怕了,但是对于较长的变量列表来说效率很低.

And this is pretty awsome, but horribly inefficient for longer lists of variables.

我尝试的列表组合:

遵循中的建议,如何生成所有排列Python中的列表转换一个元组列表到列表列表 我可以像这样设置所有变量的组合:

Following the suggestions from How to generate all permutations of a list in Python and Convert a list of tuples to a list of lists I'm able to set up a combination of ALL variables like this:

代码段4:

allTuples = list(itertools.permutations(listVars))
allCombos = [list(elem) for elem in allTuples]

截屏4:

这很有趣,但是并没有给我我所遵循的循序渐进的方法.无论如何,我希望你们中的一些人觉得这很有趣.

And that's a lot of fun, but does not give me the stepwise approach that I'm after. Anyway, I hope some of you find this interesting.

谢谢您的任何建议!

推荐答案

基于帮助,我得到了

Based on the help I got here, I've been able to put together a function that takes all columns in a pandas dataframe, defines a dependent variable, and returns all unique combinations of the remaining variables. The result differs a bit from the desired result as defined above but makes more sense for practical use, I think. I'm still hoping that others will be able to post even better solutions.

这里是:

# Imports
import pandas as pd
import numpy as np
import itertools

# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars) 
df_1 = df_1.set_index(rng)

# The function
def StepWise(columns, dependent):
    """ Takes the columns of a pandas dataframe, defines a dependent variable
        and returns all unique combinations of the remaining (independent) variables.

    """

    independent = columns.copy()
    independent.remove(dependent)

    lst1 = []
    lst2 = []
    for i in np.arange(1, len(independent)+1):
        #print(list(itertools.combinations(independent, i)))
        elem = list(itertools.combinations(independent, i))
        lst1.append(elem)
        lst2.extend(elem)

    combosIndependent = [list(elem) for elem in lst2]
    combosAll =  [[dependent, other] for other in combosIndependent]
    return(combosAll)

lExec = StepWise(columns = list(df_1), dependent = 'y')
print(lExec)

如果将其与上面的代码段3 结合使用,则可以轻松地将多个回归分析的结果存储在熊猫数据框中的指定因变量上.

If you combine this with snippet 3 above, you can easily store the results of multiple regression analyses on a specified dependent variable in a pandas data frame.

这篇关于有效地对 pandas 列的多个子集进行回归分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆