python OLS statsmodels T 未输入模型的变量统计 [英] python OLS statsmodels T Stats of variables not entered into the model

查看:68
本文介绍了python OLS statsmodels T 未输入模型的变量统计的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,已经使用 Statsmodels 创建了一个 OLS 回归

Hi have created a OLS regression using Statsmodels

我编写了一些代码,循环遍历数据框中的每个变量并将其输入模型,然后在新数据框中记录 T Stat 并构建潜在变量列表.

I've written some code that loops through every variable in a dataframe and enters it into the model and then records the T Stat in a new dataframe and builds a list of potential variables.

但是我有 20,000 个变量,所以每次运行都需要很长时间.

However I have 20,000 variables so it takes ages to run each time.

谁能想到更好的方法?

这是我目前的做法

TStatsOut=pd.DataFrame()

for i in VarsOut:
    try:
        xstrout='+'.join([baseterms,i])
        fout='ymod~'+xstrout
        modout = smf.ols(fout, data=df_train).fit()
        j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
        k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
        s=pd.concat([j, k], axis=1, join_axes=[j.index])
        TStatsOut=TStatsOut.append(s)

推荐答案

以下是我针对您的问题所发现的内容.我的回答使用了使用 dask 进行分布式计算的方法,也只是对当前方法的一般清理.

Here is what I have found in regards to your question. My answer uses the approach of using dask for distributed computing, and also just general clean up of you current approach.

我制作了一个包含 1000 个变量的较小的假数据集,一个是结果,两个是 baseterms,所以实际上有 997 个变量要循环.

I made a smaller fake dataset with 1000 variables, one will be the outcome, and two will be the baseterms, so there is really 997 variables to loop through.

import dask
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

#make some toy data for the case you showed
df_train = pd.DataFrame(np.random.randint(low=0,high=10,size=(10000, 1000)))
df_train.columns = ['var'+str(x) for x in df_train.columns]
baseterms = 'var1+var2'
VarsOut = df_train.columns[3:]

您当前代码的基线(20 秒 +- 858 毫秒):

Baseline for your current Code (20s +- 858ms):

%%timeit
TStatsOut=pd.DataFrame()

for i in VarsOut:
    xstrout='+'.join([baseterms,i])
    fout='var0~'+xstrout
    modout = smf.ols(fout, data=df_train).fit()
    j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
    k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
    s=pd.concat([j, k], axis=1)
    s=s.reindex(j.index)
    TStatsOut=TStatsOut.append(s)

为可读性创建了一个函数,但只返回每个测试变量的 pval 和回归系数,而不是一行数据框.

Created a function for readability, but returns just the pval, and regression coefficient for each variable tested instead of the one line dataframes.

def testVar(i):
    xstrout='+'.join([baseterms,i])
    fout='var0~'+xstrout
    modout = smf.ols(fout, data=df_train).fit()
    pval=modout.pvalues[i]
    coef=modout.params[i]
    return pval, coef

现在以 (14.1s +- 982ms) 运行

Now runs at (14.1s +- 982ms)

%%timeit
pvals=[]
coefs=[]

for i in VarsOut:
    pval, coef = testVar(i)
    pvals.append(pval)
    coefs.append(coef)

TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
                         index=VarsOut)[['PValue','Coeff']]

使用 Dask 延迟进行并行处理.请记住,创建的每个延迟任务也会导致轻微的开销,因此有时它可能没有好处,但将取决于您的确切数据集以及回归所花费的时间.我的数据示例可能过于简单,无法显示任何好处.

Using Dask delayed for parallel processing. Keep in mind each delayed task that is created cause a slight overhead as well, so sometimes it it may not be beneficial, but will depend on your exact dataset and how long the regressions are taking. My data example may be too simple to show any benefit.

#define the same function as before, but tell dask how many outputs it has
@dask.delayed(nout=2)
def testVar(i):
    xstrout='+'.join([baseterms,i])
    fout='var0~'+xstrout
    modout = smf.ols(fout, data=df_train).fit()
    pval=modout.pvalues[i]
    coef=modout.params[i]
    return pval, coef

现在运行 997 个候选变量并创建相同的数据帧,并延迟 dask.(18.6s +- 588ms)

Now run through the 997 candidate variables and create the same dataframe with dask delayed. (18.6s +- 588ms)

%%timeit
pvals=[]
coefs=[]

for i in VarsOut:
    pval, coef = dask.delayed(testVar)(i)
    pvals.append(pval)
    coefs.append(coef)

pvals, coefs = dask.compute(pvals,coefs)    
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
                         index=VarsOut)[['PValue','Coeff']]

同样,由于 dask 延迟创建了要跨多个处理器发送的任务,因此它会产生更多开销,因此任何性能提升都将取决于您的数据在回归中实际花费的时间以及您可用的 CPU 数量.Dask 可以从单个工作站扩展到工作站集群.

Again, dask delayed creates more overhead as it creates the tasks to be sent across many processors, so any performance gain will depend on the time your data actually takes in the regressions as well as how many CPUs you have availible. Dask can be scaled from a single workstation to a cluster of workstations.

这篇关于python OLS statsmodels T 未输入模型的变量统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆