有没有一种在 SAS 中转置大表的有效方法 [英] Is there an efficient way of transposing huge table in SAS

查看:16
本文介绍了有没有一种在 SAS 中转置大表的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SAS 中有一个需要转置的数据集.它有形式id 日期类型值我需要把它转换成id 日期 valueoftype1 valueoftype2 ...

I have a data set in SAS that I need to transpose. It has the form id date type value and I need to convert it into id date valueoftype1 valueoftype2 ...

有没有什么有效的方法可以做到这一点?我的数据量很大.

Is there any efficient way of accomplishing this? My data is huuuge.

例如;

data one; 
input ID date type $ value; 

cards; 
1 2001 A 2
1 2002 A 4
1 2001 B 3
2 2001 B 1
2 2002 A 5
2 2002 C 2
2 2003 C 5
3 2001 B 6
4 2002 B 8
4 2003 B 4
4 2001 A 2
;

我希望将其转换为以下形式;(最后三列是 valA、valB、valC)

I wish to convert it in to following form; (last three columns are valA, valB, valC)

1 2001 2 3 .
1 2002 4 . .
2 2001 . 1 .
2 2002 5 . 2
2 2003 . . 5
3 2001 . 6 .
4 2001 2 . .
4 2002 . 8 .
4 2003 . 4 .

推荐答案

PROC TRANSPOSE 将非常、非常有效地做到这一点,我敢说等于或优于任何其他 DBMS 中最有效的方法.您的数据也已经针对该方法进行了精美的组织.您只需要按 ID DATE 进行排序,除非您已经拥有该组合的索引(如果您有数十亿条记录,IMO 就必须这样做).没有其他解决方案可以接近,除非你有足够的内存来把它全部放在内存中——这对于那个大小的数据集来说是相当疯狂的(即使 10 亿条记录也至少需要 7GB,如果你有数百万个 ID,那么显然不是 1 字节的 ID;我猜是 25-30 GB 或更多.)

PROC TRANSPOSE will do this very, very efficiently, I'd venture to say equal to or better than the most efficient method of any other DBMS out there. Your data is already beautifully organized for that method, also. You just need a sort by ID DATE, unless you already have an index for that combination (which if you have billions of records is a necessity IMO). No other solution will come close, unless you have enough memory to put it all in memory - which would be rather insane for that size dataset (even 1 billion records would be a minimum of 7GB, and if you have millions of IDs then it's clearly not a 1 byte ID; i'd guess 25-30 GB or more.)

proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

对我的系统进行简单的测试,如下:

A naive test on my system, with the following:

data one; 
do id = 1 to 1e6;
  do date = '01JAN2010'd to '01JAN2012'd;
    type = byte(ceil(ranuni(7)*26)+64);
    value = ceil(ranuni(7)*20);
    output;
  end;
end;
run;
proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

该数据集已压缩约 20GB(OPTIONS COMPRESS=YES).最初写入大约需要 4 分 15 秒,排序需要 11 分钟,PROC TRANSPOSE 需要 45 分钟,写入一个 ~100GB 的压缩文件.我猜这是你能做的最好的;在这 45 分钟中,超过 20 分钟可能会写出(5 倍大的数据集将花费 5 倍以上的写出时间,加上压缩开销);当时我也在做其他事情,所以 CPU 时间可能有些膨胀,因为它没有得到我的整个处理器(这是我的桌面,4 核 i5).我不认为这是特别不合理的处理时间.

That dataset is ~20GB compressed (OPTIONS COMPRESS=YES). It took about 4 minutes 15 seconds to write initially, took 11 minutes to sort, and took 45 minutes to PROC TRANSPOSE, writing a ~100GB compressed file. I'd guess that's the best you can do; of those 45 minutes, over 20 were likely writing out (5x bigger dataset will take over 5x the time to write out, plus compression overhead); I was also doing other things at the time, so the CPU time was probably inflated some as it didn't get my entire processor (this is my desktop, a 4 core i5). I don't think this is particularly unreasonable processing time at all.

您可能会考虑查看您的需求,而转置可能并不是您真正想要的 - 您真的想要将您的表格扩大到那么大吗?您很可能无需转置整个数据集即可实现您的实际目标(您的分析等).

You might consider looking at your needs, and perhaps a transpose isn't really what you want - do you really want to grow your table that much? Odds are you can achieve your actual goal (your analysis/etc.) without transposing the entire dataset.

这篇关于有没有一种在 SAS 中转置大表的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆