有没有一种在SAS中转置大表的有效方法 [英] Is there an efficient way of transposing huge table in SAS

查看:40
本文介绍了有没有一种在SAS中转置大表的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 SAS 中有一个数据集需要转置.它有形式id 日期类型值我需要把它转换成id 日期 valueoftype1 valueoftype2 ...

I have a data set in SAS that I need to transpose. It has the form id date type value and I need to convert it into id date valueoftype1 valueoftype2 ...

有没有什么有效的方法可以做到这一点?我的数据很大.

Is there any efficient way of accomplishing this? My data is huuuge.

例如;

data one; 
input ID date type $ value; 

cards; 
1 2001 A 2
1 2002 A 4
1 2001 B 3
2 2001 B 1
2 2002 A 5
2 2002 C 2
2 2003 C 5
3 2001 B 6
4 2002 B 8
4 2003 B 4
4 2001 A 2
;

我想把它转换成下面的形式;(最后三列是valA, valB, valC)

I wish to convert it in to following form; (last three columns are valA, valB, valC)

1 2001 2 3 .
1 2002 4 . .
2 2001 . 1 .
2 2002 5 . 2
2 2003 . . 5
3 2001 . 6 .
4 2001 2 . .
4 2002 . 8 .
4 2003 . 4 .

推荐答案

PROC TRANSPOSE 将非常、非常有效地做到这一点,我敢说它等于或优于目前任何其他 DBMS 中最有效的方法.您的数据也已经针对该方法进行了精美的组织.您只需要按 ID DATE 排序,除非您已经有了该组合的索引(如果您有数十亿条记录,IMO 就必须这样做).没有其他解决方案会接近,除非您有足够的内存将其全部放入内存中 - 这对于该大小的数据集来说是相当疯狂的(即使 10 亿条记录也至少需要 7GB,如果您有数百万个 ID,那么这显然是不是 1 个字节的 ID;我猜是 25-30 GB 或更多.)

PROC TRANSPOSE will do this very, very efficiently, I'd venture to say equal to or better than the most efficient method of any other DBMS out there. Your data is already beautifully organized for that method, also. You just need a sort by ID DATE, unless you already have an index for that combination (which if you have billions of records is a necessity IMO). No other solution will come close, unless you have enough memory to put it all in memory - which would be rather insane for that size dataset (even 1 billion records would be a minimum of 7GB, and if you have millions of IDs then it's clearly not a 1 byte ID; i'd guess 25-30 GB or more.)

proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

对我的系统进行的简单测试,内容如下:

A naive test on my system, with the following:

data one; 
do id = 1 to 1e6;
  do date = '01JAN2010'd to '01JAN2012'd;
    type = byte(ceil(ranuni(7)*26)+64);
    value = ceil(ranuni(7)*20);
    output;
  end;
end;
run;
proc sort data=one;
by id date;
run;
proc transpose data=one out=want;
by id date;
id type;
var value;
run;

该数据集已压缩约 20GB(OPTIONS COMPRESS=YES).最初写入大约需要 4 分 15 秒,排序需要 11 分钟,PROC TRANSPOSE 需要 45 分钟,写入一个 ~100GB 的压缩文件.我想这是你能做的最好的事情;在这 45 分钟中,有超过 20 分钟可能被写出(5 倍大的数据集将花费超过 5 倍的时间来写出,加上压缩开销);当时我也在做其他事情,所以 CPU 时间可能被夸大了一些,因为它没有得到我的整个处理器(这是我的台式机,一个 4 核 i5).我认为这根本不是特别不合理的处理时间.

That dataset is ~20GB compressed (OPTIONS COMPRESS=YES). It took about 4 minutes 15 seconds to write initially, took 11 minutes to sort, and took 45 minutes to PROC TRANSPOSE, writing a ~100GB compressed file. I'd guess that's the best you can do; of those 45 minutes, over 20 were likely writing out (5x bigger dataset will take over 5x the time to write out, plus compression overhead); I was also doing other things at the time, so the CPU time was probably inflated some as it didn't get my entire processor (this is my desktop, a 4 core i5). I don't think this is particularly unreasonable processing time at all.

您可能会考虑查看您的需求,也许转置并不是您真正想要的 - 您真的想把桌子扩大这么多吗?您可能无需转置整个数据集即可实现您的实际目标(您的分析等).

You might consider looking at your needs, and perhaps a transpose isn't really what you want - do you really want to grow your table that much? Odds are you can achieve your actual goal (your analysis/etc.) without transposing the entire dataset.

这篇关于有没有一种在SAS中转置大表的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆