SAS,过程摘要中哪些统计数据计算更快? [英] Which statistics is calculated faster in SAS, proc summary?

查看:117
本文介绍了SAS,过程摘要中哪些统计数据计算更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个理论上的答案。



想象一下,您有一个包含15亿行的表(该表是使用DB2-Blu创建为基于列的表)。 / p>

您正在使用SAS,并且将使用 Proc Summary 进行一些统计,例如最小值/最大值/平均值偏差值和百分位数10,百分位数90在整个同级组中。



例如,您有30.000个同级组,每个对等组中有50.000个值(总计15亿个值)。



另一种情况,您有300万个同级组,每个同级组中也有50个值。这样,您又有了15亿个值。



如果您的对等组数量较少,但每个对等组中的值更多,会更快吗?还是它会随着更多的对等组而走得更快,而每个对等组中的值都更少。



我可以测试第一种情况(30.000个对等组,每个组50.000个值小组),大约花了16分钟。但是我无法测试第二种情况。



如果我有300万个对等组并且每个对等组中也有50个值,您可以写出运行时的近似预测吗?



问题的另一个维度。如果我改用 Proc SQL 进行统计,会更快吗?



示例代码如下: p>

  proc摘要数据= table_blu缺少字符类型; 
class var1 var2; / * Var1和var2一起是对等组* /
var值;

输出= stattable(重命名=(_type_ =类型)drop = _freq_)
n = n min = min max = max平均值=平均值std = std q1 = q1 q3 ​​= q3 p10 = p10 p90 = p90 p95 = p95
;
运行;


解决方案

所以这里有很多事情要考虑。



关于性能的第一点,很可能是最大的一点,就是将数据从DB2导入SAS。 (我假设这不是SAS的数据库实例-如果是,请更正)。那是一张大桌子,在电线上移动它需要时间。因此,如果您可以使用SQL语句在DB2内计算所有这些统计信息,则可能是最快的选择。



因此,假设您已将表下载到SAS服务器:



CLASS 变量排序的表比未排序的表处理的速度要快得多。如果SAS知道该表已排序,则不必扫描表以将记录归为一组,它可以进行块读取,而不是随机IO。



如果不对表进行排序,则组数越多,则必须进行的表扫描就越多。



重点是,从表中获取数据的速度。在没有排序的进程中,高清到CPU至关重要。



从那里开始,您会遇到内存和CPU问题。 PROC Summary是多线程的,SAS一次将读取N个组。如果组大小可以容纳为该线程分配的内存,则不会有问题。如果组的大小太大,则SAS必须分页。



我将问题缩小到1500万行示例:

  %let grps = 3000; 
%let pergrp = 5000;

未分类:

 <注意:从数据集
WORK.TEST中读取了1500万个观测值。
注意:数据集WORK.SUMMARY具有3001个观测值和9个
变量。
注意:使用的过程摘要(总处理时间):
实时20.88秒
cpu时间31.71秒

已排序:

 注意:从数据集中读取了15000000个观测值
工作测试。
注意:数据集WORK.SUMMARY具有3001个观测值和9个
变量。
注意:使用的过程摘要(总处理时间):
实时5.44秒
cpu时间11.26秒

============================

 %let grps = 300000; 
%let pergrp = 50;

未分类:

 <注意:从数据集
WORK.TEST中读取了1500万个观测值。
注意:数据集WORK.SUMMARY具有300001个观测值和9个
变量。
注意:使用的过程摘要(总处理时间):
实时19.26秒
cpu时间41.35秒

已排序:

 注意:从数据集中读取了15000000个观测值
工作测试。
注意:数据集WORK.SUMMARY具有300001个观测值和9个
变量。
注意:使用的过程摘要(总处理时间):
实时5.43秒
cpu时间10.09秒

我运行了几次,运行时间相似。排序时间大致相等,并且速度更快。



更多的组/每个组更少的未排序速度更快,但从总CPU使用率来看,它更高。我的笔记本电脑具有非常快的SSD,因此IO可能不是限制因素-HD能够满足多核CPU的需求。在HD速度较慢的系统上,总运行时间可能会有所不同。



最后,它在很大程度上取决于数据的结构方式以及您的细节。服务器和数据库。


I need a theoretical answer.

Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).

You are using SAS and you will do some statistics by using Proc Summary like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.

For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).

The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.

Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.

I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.

Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?

One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL instead?

Example code is below:

proc summary data = table_blu missing chartype;
   class var1 var2; /* Var1 and var2 are toghether peer-group  */
   var values;

   output out = stattable(rename = (_type_ = type) drop = _freq_)
   n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95 
;
run;

解决方案

So there are a number of things to think about here.

The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.

So assuming you've downloaded the table to the SAS server:

A table sorted by the CLASS variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.

If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.

The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.

From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.

I scaled down the problem to a 15M row example:

%let grps=3000;
%let pergrp=5000;

UNSORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           20.88 seconds
      cpu time            31.71 seconds

SORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           5.44 seconds
      cpu time            11.26 seconds

=============================

%let grps=300000;
%let pergrp=50;

UNSORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           19.26 seconds
      cpu time            41.35 seconds

SORTED:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           5.43 seconds
      cpu time            10.09 seconds

I ran these a few times and the run times were similar. Sorted times are about equal and way faster.

The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.

In the end, it depends too much on how the data is structured and the specifics of your server and DB.

这篇关于SAS,过程摘要中哪些统计数据计算更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆