高效连接多个 sas 数据集 [英] Efficiently concatenate many sas datasets

查看:34
本文介绍了高效连接多个 sas 数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有超过 20 万个具有相同变量(n<1000,通常是 n<100)的小数据集,我想将它们连接到主数据集中.我尝试使用一个宏,它使用一个数据步骤来遍历所有新数据集,并使用set master new:"与主数据集连接,但这需要很长时间.此外,如果我尝试同时运行,调用执行数据步骤会说我在一个巨大的服务器机器上内存不足.作为参考,所有小数据集加起来刚刚超过 5 Gigs.任何建议将不胜感激.这是我到目前为止所拥有的:

I have over 200k small datasets with the same variables (n<1000 and usually n<100) that I want to concatenate into a master dataset. I have tried using a macro that uses a data step to just iterate through all of the new datasets and concatenate with the master with "set master new:", but this is taking a really long time. Also, if I try to run at the same time, the call execute data step says that I am out of memory on a huge server box. For reference, all of the small datasets together are just over 5 Gigs. Any suggestions would be greatly appreciated. Here is what I have so far:

%macro catDat(name, nbr) ;
    /*call in new dataset */
    data new ;
    set libin.&name ;
    run ;

    /* reorder names */
    proc sql noprint;
    create table new as 
    select var1, var2, var3
    from new;
    quit;

    %if &nbr = 1 %then %do ;
        data master;
        set new;
        run;
    %end; 
    %if &nbr > 1 %then %do ;
        data master;
        set master new ;
        run;
    %end ;
%mend;

/* concatenate datasets */
data runthis ;
set datasetNames ;
call execute('%catdat('||datasetname||','||_n_||')');
run;

已解决:请参阅下面 Bob 的评论.

Resolved: see Bob's comments below.

推荐答案

尝试使用 PROC APPEND 而不是你的新"数据集;这会快得多:

Try using PROC APPEND instead of your "new" dataset; that will be much, much faster:

%macro DOIT;

proc sql noprint;
   select count(*) into : num_recs
   from datasetNames;
quit;

%do i=1 %to &num_recs;

data _null_;
  i = &i;
  set datasetNames point=i;
  call symput('ds_name',datasetname);
  stop;
run; /* UPDATE:  added this line */

%if &i = 1 %then %do;
/* Initialize MASTER with variables in the order you wish */
data master(keep=var1 var2 var3);
   retain var1 var2 var3;
   set libin.&ds_name;
   stop;
run;
%end;

proc append base=master data=libin.&ds_name(keep=var1 var2 var3);
run;

%end;

%mend DOIT;

PROC APPEND 会将每个数据集添加到您的新主"中,而无需像您现在所做的那样每次都重新构建它.这也避免了使用 CALL EXECUTE,消除了您遇到的内存问题(由于在执行堆栈中生成了太多代码).

PROC APPEND will add each dataset into your new "master" without rebuilding it each time as you are doing now. This also avoids using CALL EXECUTE, removing that memory issue you were running into (caused by generating so much code into the execution stack).

这篇关于高效连接多个 sas 数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆