有效地连接许多sas数据集 [英] Efficiently concatenate many sas datasets

查看:87
本文介绍了有效地连接许多sas数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有200,000多个小型数据集,这些数据集具有相同的变量(n <1000,通常n <100),我希望将它们连接到主数据集中。我尝试使用一个宏,该宏使用数据步骤来迭代所有新数据集,并使用 set master new:与主服务器连接,但这花费了很长时间。另外,如果我尝试同时运行,则调用执行数据步骤会说我在一个巨大的服务器盒中内存不足。作为参考,所有小型数据集合计刚好超过5个Gig。任何建议将不胜感激。这是我到目前为止的内容:

I have over 200k small datasets with the same variables (n<1000 and usually n<100) that I want to concatenate into a master dataset. I have tried using a macro that uses a data step to just iterate through all of the new datasets and concatenate with the master with "set master new:", but this is taking a really long time. Also, if I try to run at the same time, the call execute data step says that I am out of memory on a huge server box. For reference, all of the small datasets together are just over 5 Gigs. Any suggestions would be greatly appreciated. Here is what I have so far:

%macro catDat(name, nbr) ;
    /*call in new dataset */
    data new ;
    set libin.&name ;
    run ;

    /* reorder names */
    proc sql noprint;
    create table new as 
    select var1, var2, var3
    from new;
    quit;

    %if &nbr = 1 %then %do ;
        data master;
        set new;
        run;
    %end; 
    %if &nbr > 1 %then %do ;
        data master;
        set master new ;
        run;
    %end ;
%mend;

/* concatenate datasets */
data runthis ;
set datasetNames ;
call execute('%catdat('||datasetname||','||_n_||')');
run;

已解决:请参见下面的Bob注释。

Resolved: see Bob's comments below.

推荐答案

尝试使用 PROC APPEND 代替新数据集;

Try using PROC APPEND instead of your "new" dataset; that will be much, much faster:

%macro DOIT;

proc sql noprint;
   select count(*) into : num_recs
   from datasetNames;
quit;

%do i=1 %to &num_recs;

data _null_;
  i = &i;
  set datasetNames point=i;
  call symput('ds_name',datasetname);
  stop;
run; /* UPDATE:  added this line */

%if &i = 1 %then %do;
/* Initialize MASTER with variables in the order you wish */
data master(keep=var1 var2 var3);
   retain var1 var2 var3;
   set libin.&ds_name;
   stop;
run;
%end;

proc append base=master data=libin.&ds_name(keep=var1 var2 var3);
run;

%end;

%mend DOIT;

PROC APPEND 会将每个数据集添加到您的新的母版,而无需像现在这样每次都对其进行重建。
这也避免了使用 CALL EXECUTE ,消除了您遇到的内存问题(由于在执行堆栈中生成了这么多代码)。

PROC APPEND will add each dataset into your new "master" without rebuilding it each time as you are doing now. This also avoids using CALL EXECUTE, removing that memory issue you were running into (caused by generating so much code into the execution stack).

这篇关于有效地连接许多sas数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆