Python 2.7-statsmodels-格式化和编写摘要输出 [英] Python 2.7 - statsmodels - formatting and writing summary output

查看:98
本文介绍了Python 2.7-statsmodels-格式化和编写摘要输出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Mac OSX Lion上,我正在使用pandas 0.11.0(数据处理)和statsmodels 0.4.3进行逻辑回归,以进行实际回归.

I'm doing logistic regression using pandas 0.11.0(data handling) and statsmodels 0.4.3 to do the actual regression, on Mac OSX Lion.

我将要运行约2,900个不同的逻辑回归模型,并且需要将结果输出到csv文件并以特定方式进行格式化.

I'm going to be running ~2,900 different logistic regression models and need the results output to csv file and formatted in a particular way.

目前,我只知道执行print result.summary()会将结果(如下所示)打印到shell:

Currently, I'm only aware of doing print result.summary() which prints the results (as follows) to the shell:

 Logit Regression Results                           
  ==============================================================================
 Dep. Variable:            death_death   No. Observations:                 9752
 Model:                          Logit   Df Residuals:                     9747
 Method:                           MLE   Df Model:                            4
 Date:                Wed, 22 May 2013   Pseudo R-squ.:                -0.02672
 Time:                        22:15:05   Log-Likelihood:                -5806.9
 converged:                       True   LL-Null:                       -5655.8
                                         LLR p-value:                     1.000
 ===============================================================================
                   coef    std err          z      P>|z|      [95.0% Conf. Int.]
 -------------------------------------------------------------------------------
 age_age5064    -0.1999      0.055     -3.619      0.000        -0.308    -0.092
 age_age6574    -0.2553      0.053     -4.847      0.000        -0.359    -0.152
 sex_female     -0.2515      0.044     -5.765      0.000        -0.337    -0.166
 stage_early    -0.1838      0.041     -4.528      0.000        -0.263    -0.104
 access         -0.0102      0.001    -16.381      0.000        -0.011    -0.009
 ===============================================================================

我还需要比值比,它由print np.exp(result.params)计算得出,并按如下方式打印在外壳中:

I will also need the odds ratio, which is computed by print np.exp(result.params), and is printed in the shell as such:

age_age5064    0.818842
age_age6574    0.774648
sex_female     0.777667
stage_early    0.832098
access         0.989859
dtype: float64

我需要的是将它们分别以非常长的行的形式写入csv文件,例如(我不确定在这一点上是否需要Log-Likelihood之类的东西,但已经将其包含在了为了彻底):

What I need is for these each to be written to a csv file in form of a very lon row like (am not sure, at this point, whether I will need things like Log-Likelihood, but have included it for the sake of thoroughness):

`Log-Likelihood, age_age5064_coef, age_age5064_std_err, age_age5064_z, age_age5064_p>|z|,...age_age6574_coef, age_age6574_std_err, ......access_coef, access_std_err, ....age_age5064_odds_ratio, age_age6574_odds_ratio, ...sex_female_odds_ratio,.....access_odds_ratio`

我认为您会看到图片-一行很长,包含所有这些实际值,并且标题包含所有以类似格式显示的列.

I think you get the picture - a very long row, with all of these actual values, and a header with all the column designations in a similar format.

我熟悉Python中的csv module,并且对pandas也越来越熟悉.不知道一旦所有约2,900个逻辑回归模型都完成后,是否可以使用to_csv格式化该信息并将其存储在pandas dataframe中,然后将其写入文件;那当然很好.另外,在每个模型完成时编写它们也是可以的(使用csv module).

I am familiar with the csv module in Python, and am becoming more familiar with pandas. Not sure whether this info could be formatted and stored in a pandas dataframe and then written, using to_csv to a file once all ~2,900 logistic regression models have completed; that would certainly be fine. Also, writing them as each model is completed is also fine (using csv module).

更新:

因此,我一直在statsmodels网站上查看更多内容,特别是试图弄清楚如何将模型的结果存储在类中.看起来好像有一个名为结果"的类,将需要使用它.我认为,使用此类的继承来创建另一个类,可能会改变一些方法/运算符,从而获得所需的格式.我在执行此操作方面的经验很少,因此需要花很多时间来解决这个问题(很好).如果有人可以帮助/有更多的经验,那就太好了!

So, I was looking more at statsmodels site, specifically trying to figure out how the results of a model are stored within classes. It looks like there is a class called 'Results', which will need to be used. I think using inheritance from this class to create another class, where some of the methods/operators get changed might be the way to go, in order to get the formatting I require. I have very little experience in the ways of doing this, and will need to spend quite a bit of time figuring this out (which is fine). If anybody can help/has more experience that would be awesome!

这里是布置类的站点: statsmodels结果类

Here is the site where the classes are laid out: statsmodels results class

推荐答案

当前没有预制的参数表及其结果统计信息.

There is no premade table of parameters and their result statistics currently available.

基本上,您需要自己堆叠所有结果,无论是列表,numpy数组还是pandas DataFrame,这取决于对您而言更方便的事情.

Essentially you need to stack all the results yourself, whether in a list, numpy array or pandas DataFrame depends on what's more convenient for you.

例如,如果我想要一个具有模型,llf结果和汇总参数表中结果的numpy数组,那么我可以使用

for example, if I want one numpy array that has the results for a model, llf and results in the summary parameter table, then I could use

res_all = []
for res in results:
    low, upp = res.confint().T   # unpack columns 
    res_all.append(numpy.concatenate(([res.llf], res.params, res.tvalues, res.pvalues, 
                   low, upp)))

但是,根据不同模型的结构,最好将熊猫与大熊猫对齐.

But it might be better to align with pandas, depending on what structure you have across models.

您可以编写一个辅助函数,该函数将从结果实例中获取所有结果并将它们串联在一起.

You could write a helper function that takes all the results from the results instance and concatenates them in a row.

(我不确定按行写入csv最方便的方法)

(I'm not sure what's the most convenient for writing to csv by rows)

这是一个将回归结果存储在数据框中的示例

Here is an example storing the regression results in a dataframe

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/sandbox/multilinear.py#L21

循环在第159行.

summary()和statsmodels之外的类似代码,例如 http://johnbeieler.org/py_apsrtable/面向打印而不是存储变量.

summary() and similar code outside of statsmodels, for example http://johnbeieler.org/py_apsrtable/ for combining several results, is oriented towards printing and not to store variables.

这篇关于Python 2.7-statsmodels-格式化和编写摘要输出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆