python OLS statsmodels T 未输入模型的变量统计 [英] python OLS statsmodels T Stats of variables not entered into the model
问题描述
您好,已经使用 Statsmodels 创建了一个 OLS 回归
Hi have created a OLS regression using Statsmodels
我编写了一些代码,循环遍历数据框中的每个变量并将其输入模型,然后在新数据框中记录 T Stat 并构建潜在变量列表.
I've written some code that loops through every variable in a dataframe and enters it into the model and then records the T Stat in a new dataframe and builds a list of potential variables.
但是我有 20,000 个变量,所以每次运行都需要很长时间.
However I have 20,000 variables so it takes ages to run each time.
谁能想到更好的方法?
这是我目前的做法
TStatsOut=pd.DataFrame()
for i in VarsOut:
try:
xstrout='+'.join([baseterms,i])
fout='ymod~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1, join_axes=[j.index])
TStatsOut=TStatsOut.append(s)
推荐答案
以下是我针对您的问题所发现的内容.我的回答使用了使用 dask
进行分布式计算的方法,也只是对当前方法的一般清理.
Here is what I have found in regards to your question. My answer uses the approach of using dask
for distributed computing, and also just general clean up of you current approach.
我制作了一个包含 1000 个变量的较小的假数据集,一个是结果,两个是 baseterms
,所以实际上有 997 个变量要循环.
I made a smaller fake dataset with 1000 variables, one will be the outcome, and two will be the baseterms
, so there is really 997 variables to loop through.
import dask
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
#make some toy data for the case you showed
df_train = pd.DataFrame(np.random.randint(low=0,high=10,size=(10000, 1000)))
df_train.columns = ['var'+str(x) for x in df_train.columns]
baseterms = 'var1+var2'
VarsOut = df_train.columns[3:]
您当前代码的基线(20 秒 +- 858 毫秒):
Baseline for your current Code (20s +- 858ms):
%%timeit
TStatsOut=pd.DataFrame()
for i in VarsOut:
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
j=pd.DataFrame(modout.pvalues,index=[i],columns=['PValue'])
k=pd.DataFrame(modout.params,index=[i],columns=['Coeff'])
s=pd.concat([j, k], axis=1)
s=s.reindex(j.index)
TStatsOut=TStatsOut.append(s)
为可读性创建了一个函数,但只返回每个测试变量的 pval 和回归系数,而不是一行数据框.
Created a function for readability, but returns just the pval, and regression coefficient for each variable tested instead of the one line dataframes.
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
现在以 (14.1s +- 982ms) 运行
Now runs at (14.1s +- 982ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = testVar(i)
pvals.append(pval)
coefs.append(coef)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
使用 Dask 延迟进行并行处理.请记住,创建的每个延迟任务也会导致轻微的开销,因此有时它可能没有好处,但将取决于您的确切数据集以及回归所花费的时间.我的数据示例可能过于简单,无法显示任何好处.
Using Dask delayed for parallel processing. Keep in mind each delayed task that is created cause a slight overhead as well, so sometimes it it may not be beneficial, but will depend on your exact dataset and how long the regressions are taking. My data example may be too simple to show any benefit.
#define the same function as before, but tell dask how many outputs it has
@dask.delayed(nout=2)
def testVar(i):
xstrout='+'.join([baseterms,i])
fout='var0~'+xstrout
modout = smf.ols(fout, data=df_train).fit()
pval=modout.pvalues[i]
coef=modout.params[i]
return pval, coef
现在运行 997 个候选变量并创建相同的数据帧,并延迟 dask.(18.6s +- 588ms)
Now run through the 997 candidate variables and create the same dataframe with dask delayed. (18.6s +- 588ms)
%%timeit
pvals=[]
coefs=[]
for i in VarsOut:
pval, coef = dask.delayed(testVar)(i)
pvals.append(pval)
coefs.append(coef)
pvals, coefs = dask.compute(pvals,coefs)
TStatsOut = pd.DataFrame(data={'PValue':pvals, 'Coeff':coefs},
index=VarsOut)[['PValue','Coeff']]
同样,由于 dask 延迟创建了要跨多个处理器发送的任务,因此它会产生更多开销,因此任何性能提升都将取决于您的数据在回归中实际花费的时间以及您可用的 CPU 数量.Dask 可以从单个工作站扩展到工作站集群.
Again, dask delayed creates more overhead as it creates the tasks to be sent across many processors, so any performance gain will depend on the time your data actually takes in the regressions as well as how many CPUs you have availible. Dask can be scaled from a single workstation to a cluster of workstations.
这篇关于python OLS statsmodels T 未输入模型的变量统计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!