如何循环n次逻辑回归? [英] How to loop a logistic regression n number of times?

查看:19
本文介绍了如何循环n次逻辑回归?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SAS 上获得了一段预测消费者行为的代码.到目前为止,我手动做了 50 个样本和 50 个逻辑回归,但我想自动化这个过程.步骤如下:

I got a piece of code on SAS that predicts consumer behavior. So far I did 50 samples with 50 logistic regression by hand, but I'd like to automate this process. Steps are as follows:

  • 创建一个表,所有客户端的值为1"
  • 创建一个表,所有客户端的值为0"
  • (下面的代码从这里开始)启动一个循环:
    • 从客户那里获取价值1"的 3000 人样本
    • 从客户那里获取价值0"的 3000 人样本
    • 加入这两个表
    • 应该作为输出的逻辑回归(ROC 值和最大似然估计值)

    您会在下面找到一段代码.你能建议我如何将这个逻辑回归循环 50 次吗?到目前为止,我无法让它工作......我是 SQL 的初学者

    You'll find below a piece of the code. Can you advice me on how to loop this logistic regression 50 times please? So far I can't make it work... I'm beginner in SQL

    %macro RunReg (DSName, NumVars) ;
    %do i=1 %to &NumVars
    
    /* Create a 3000 people sample called TOP_1*/
    PROC SURVEYSELECT DATA= TOP_1
        OUT= ALEA_1
        METHOD=SRS
        N=3000;
    QUIT;
    
    
    /* Create a 3000 people sample called TOP_0*/
    PROC SURVEYSELECT DATA= TOP_0
        OUT= ALEA_0
        METHOD=SRS
        N=3000;
    QUIT;
    
    
    /*Append both tables */
    PROC SQL;
        CREATE TABLE BOTH_SAMPLES As
        SELECT * FROM TOP_1
          OUTER UNION CORR
        SELECT * FROM TOP_0;
    QUIT;
    
    
    /* Logistic regression*/
    DATA WORK.&DSName noprint
            Outset=PE(rename(x&i=Value));
        Model Y = x&I;
        SET WORK.APPEND_TABLE(IN=__ORIG) WORK.BASE_PREDICT_2;
        __FLAG=__ORIG;
        __DEP=TOP_CREDIT_HABITAT_2017;
        if not __FLAG then TOP_CREDIT_HABITAT_2017=.;
    RUN;
    
    PROC SQL;
        CREATE VIEW WORK.SORTTempTableSorted AS
            SELECT *
        FROM WORK.TMP0TempTableAddtnlPredictData
    ;
    QUIT;
    TITLE;
    TITLE1 "Résultats de la régression logistique";
    FOOTNOTE;
    FOOTNOTE1 "Généré par le Système SAS (&_SASSERVERNAME, &SYSSCPL) le %TRIM(%QSYSFUNC(DATE(), NLDATE20.)) à %TRIM(%SYSFUNC(TIME(), TIMEAMPM12.))";
    PROC LOGISTIC DATA=WORK.SORTTempTableSorted
            PLOTS(ONLY)=NONE
        ;
        CLASS age_classe    (PARAM=EFFECT) Flag_bq_principale   (PARAM=EFFECT) flag_univers_detenus     (PARAM=EFFECT) csp_1    (PARAM=EFFECT) SGMT_FIDELITE    (PARAM=EFFECT) situ_fam_1   (PARAM=EFFECT);
        MODEL TOP_CREDIT_HABITAT_2017 (Event = '1')=top_situ_particuliere top_chgt_csp_6M top_produit_monetaire_bloque top_CREDIT top_chgt_contrat_travail_6M top_credit_CONSO top_credit_HABITAT top_produit_monetaire_dispo top_VM_autres top_Sicav top_produit_epargne_logement top_Predica top_ferm_prod_6M top_ouv_prod_6M top_produit_Assurance top_produit_Cartes top_produit_Credit "moy_surface_financière_6M"n moy_surf_financiere_ecart_6M moy_encours_dav_6M moy_encours_dav_ecart_6M moy_monetaire_dispo_6M moy_monetaire_dispo_ecart_6M moy_emprunts_6M moy_emprunts_ecarts_6M moy_sicav_6M moy_sicav_ecart_6M moy_vm_autres_6M moy_vm_autres_ecart_6M moy_predica_6M moy_predica_ecart_6M moy_bgpi_6M moy_bgpi_ecart_6M moy_epargne_logement_6M moy_epargne_logement_ecart_6M "moy.an_mt_flux_cred_norme_B2"n "moy.an_mt_op_cred_ep_a_terme"n "moy.an_mt_op_debit_ep_a_terme"n "moy.an_mt_ope_credit_depot"n "moy.an_mt_ope_credit_ep_a_vue"n "moy.an_mt_ope_debit_depot"n "moy.an_mt_ope_debit_ep_a_vue"n "moy.an_mt_pmts_carte_etr"n "moy.an_mt_remise_chq"n "moy.an_mt_paie_carte"n "moy.an_mt_paie_chq"n "moy.an_nb_paie_carte"n "moy.an_nb_paie_chq"n "moy.an_mt_ret_carte_Aut_bq"n "moy.an_mt_ret_carte_CRCA"n "moy.an_mt_ret_carte_etr"n "moy.an_nb_flux_cred_normeB2"n "moy.an_nb_ope_credit_ep_a_terme"n "moy.an_nb_ope_debit_ep_a_terme"n "moy.an_nb_ope_credit_depot"n "moy.an_nb_ope_credit_ep_a_vue"n "moy.an_nb_ope_debit_depot"n "moy.an_nb_ope_debit_ep_a_vue"n "moy.an_nb_pmts_carte_etr"n "moy.an_nb_remise_chq"n "moy.an_nb_ret_carte_Aut_bq"n "moy.an_nb_ret_carte_CRCA"n "moy.an_nb_ret_carte_etr"n "moy.an_nb_ret_carte"n "moy.an_mt_factu_ttc"n "moy.an_mt_reduc_ttc"n "moy.an_mt_rist_ttc"n "moy.an_mt_mvt_domicilie_mktg"n "moy.an_nb_mvt_M_domicilie_mktg"n top_produit_Epargne top_ouverture_reclam age_classe Flag_bq_principale flag_univers_detenus csp_1 SGMT_FIDELITE situ_fam_1     /
            SELECTION=STEPWISE
            SLE=0.05
            SLS=0.05
            INCLUDE=0
            LINK=LOGIT
            OUTROC=_PROB_
            ALPHA=95
            EXPEST
            PARMLABEL
            CORRB
            NOPRINT
        ;
    
        OUTPUT OUT=WORK.PREDLogRegPredictions(LABEL="Statistiques et prédictions de régression logistique pour WORK.APPEND_TABLE" WHERE=(NOT __FLAG))
            PREDPROBS=INDIVIDUAL;
    RUN;
    QUIT;
    %end;
    %mend;
    
    DATA WORK.PREDLogRegPredictions; 
        set WORK.PREDLogRegPredictions; 
        TOP_CREDIT_HABITAT_2017=__DEP; 
        _FROM_=__DEP;
        DROP __DEP; 
        DROP __FLAG;
    RUN ;
    QUIT ;
    

    提前致谢

    推荐答案

    如果你正在尝试做一个引导算法或类似的东西,关于这个主题的开创性论文是 David Cassell 的 不要成为 2007 SGF 的 LOOPy.概括地说,这描述了执行此操作的旧"方法(涉及循环,您对新样本进行采样,然后执行 50 次分析;以及使用 PROC SURVEYSELECT 的新方法使用 rep 选项.

    If you're trying to do a bootstrap algorithm or something similar to that, the seminal paper on the topic is David Cassell's Don't be LOOPy from the 2007 SGF. In broad strokes, this describes the "old" way to do this (involving a loop, where you sample a new sample and then perform an analysis 50 times; and the new way, where you use PROC SURVEYSELECT with the rep option.

    来自论文,例子:

    proc surveyselect data=YourData out=outboot
     seed=30459584
     method=urs samprate=1 outhits
     rep=1000;
     run;
    

    这会生成一个带有 Replicate 变量的数据集,然后您可以在大多数分析中将其用作 by 变量.然后,这对变量的每个值分别执行分析,这可能是您想要的.您可以使用 procsurveyselect 上的各种选项来获取您想要的样本(样本大小/比率、抽样方法等)

    This generates a dataset with a Replicate variable, which you can then use as a by variable in most analyses. This then performs the analysis separately for each value of the variable, which is presumably what you want. You can use the various options on proc surveyselect to get the samples you want (sample size/rate, method of sampling, etc.)

    如果您只是想将数据集拆分为多个块,以便进行较小的分析(因为运行大型分析可能需要很长时间)或进行测试和验证子样本,但不要关心随机事物有多好,您可以像这样在 datastep 中添加一个变量:

    If you're trying just to split your dataset up into chunks so you can either do a smaller analysis (as perhaps it might take too long to run the big one) or to do test and validation subsamples, but don't care about how nicely random things are, you can just add a variable in the datastep like so:

    data for_regression;
      set your_data;
      sample_group = mod(_n_,50);
    run;
    
    proc sort data=for_regression;
      by sample_group;
    run;
    

    然后你有 50 个组;如果您希望它们更加随机化"并且不认为它们现在是,您可以先随机排序,但 PROC SURVEYSELECT 通常最终更适合那种事情.

    And then you have 50 groups; you can sort by something random first if you prefer them be more "randomized" and don't think they are now, but PROC SURVEYSELECT is usually better for that sort of thing ultimately.

    这篇关于如何循环n次逻辑回归?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆