在 SAS,proc summary 中哪些统计数据计算得更快? [英] Which statistics is calculated faster in SAS, proc summary?

查看:34
本文介绍了在 SAS,proc summary 中哪些统计数据计算得更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要一个理论上的答案.

I need a theoretical answer.

假设您有一个包含 15 亿行的表(该表是使用 DB2-Blu 创建为基于列的).

Imagine that you have a table with 1.5 billion rows (the table is created as column-based with DB2-Blu).

您正在使用 SAS,您将使用 Proc Summary 进行一些统计,例如最小/最大/平均值、标准差值和 percentile-10、percentile-90 通过您的同行组.

You are using SAS and you will do some statistics by using Proc Summary like min/max/mean values, standard deviation value and percentile-10, percentile-90 through your peer-groups.

例如,您有 30.000 个对等组,每个对等组中有 50.000 个值(总计 15 亿个值).

For instance, you have 30.000 peer-groups and you have 50.000 values in each peer group (Total 1.5 billions values).

在另一种情况下,您有 300 万个对等组,并且每个对等组中有 50 个值.所以你又拥有了 15 亿个价值.

The other case you have 3 million peer-groups and also you have 50 values in each peer-group. So you have total 1.5 billion values again.

如果您的对等组较少但每个对等组中的值更多,它会更快吗?或者,如果有更多的同级组,但每个同级组中的值较少,它会更快吗?

Would it go faster if you have less peer groups but more values in each peer-group? Or would it go faster with more peer-groups but less less values in each peer-group.

我可以测试第一个案例(30.000 个对等组和每个对等组 50.000 个值),大约需要 16 分钟.但我无法测试第二种情况.

I could test the first case (30.000 peer-groups and 50.000 values per peer group) and it took around 16 mins. But I can't test for the second case.

如果我有 300 万个对等组并且每个对等组中还有 50 个值,您能否为运行时间编写一个近似预测?

Can you write an approximate prognose for run-time in case when I have 3 million peer-groups and also 50 values in each peer-group?

问题的另一个维度.如果我改用 Proc SQL 来做这些统计会更快吗?

One more dimension for the question. Would it be faster to do those statistics if I use Proc SQL instead?

示例代码如下:

proc summary data = table_blu missing chartype;
   class var1 var2; /* Var1 and var2 are toghether peer-group  */
   var values;

   output out = stattable(rename = (_type_ = type) drop = _freq_)
   n=n min=min max=max mean=mean std=std q1=q1 q3=q3 p10=p10 p90=p90 p95=p95 
;
run;

推荐答案

所以这里有很多事情要考虑.

So there are a number of things to think about here.

就性能而言,第一点而且很可能是最大的一点是将数据从 DB2 导入 SAS.(我假设这不是 SAS 的数据库实例——如果是,请纠正我).那是一张大桌子,在电线上移动它需要时间.因此,如果您可以在 DB2 中使用 SQL 语句计算所有这些统计信息,那可能是您最快的选择.

The first point and quite possibly the largest in terms of performance is getting the data from DB2 into SAS. (I'm assuming this is not an in database instance of SAS -- correct me if it is). That's a big table and moving it across the wire takes time. Because of that, if you can calculate all these statistics inside DB2 with an SQL statement, that will probably be your fastest option.

所以假设您已将表格下载到 SAS 服务器:

So assuming you've downloaded the table to the SAS server:

CLASS 变量排序的表比未排序的表处理起来要快得多.如果 SAS 知道表已排序,则不必扫描表中的记录即可进入组,它可以进行块读取而不是随机 IO.

A table sorted by the CLASS variables will be MUCH faster to process than an unsorted table. If SAS knows the table is sorted, it doesn't have to scan the table for records to go into a group, it can do block reads instead of random IO.

如果表未排序,则组数越大,则必须发生的表扫描越多.

If the table is not sorted, then the larger the number of groups, then more table scans that have to occur.

关键是,在未排序的过程中,将数据从 HD 获取到 CPU 的速度至关重要.

The point is, the speed of getting data from the HD to the CPU will be paramount in an unsorted process.

从那里,您会遇到内存和 cpu 问题.PROC Summary 是多线程的,SAS 将一次读取 N 个组.如果组大小可以适合为该线程分配的内存,则不会有问题.如果组大小太大,那么 SAS 将不得不分页.

From there, you get into a memory and cpu issue. PROC SUMMARY is multithreaded and SAS will read N groups at a time. If group size can fit into the memory allocated for that thread, you won't have an issue. If the group size is too large, then SAS will have to page.

我将问题缩小到 15M 行示例:

I scaled down the problem to a 15M row example:

%let grps=3000;
%let pergrp=5000;

未分类:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           20.88 seconds
      cpu time            31.71 seconds

排序:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 3001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           5.44 seconds
      cpu time            11.26 seconds

==============================

=============================

%let grps=300000;
%let pergrp=50;

未分类:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           19.26 seconds
      cpu time            41.35 seconds

排序:

NOTE: There were 15000000 observations read from the data set
      WORK.TEST.
NOTE: The data set WORK.SUMMARY has 300001 observations and 9
      variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
      real time           5.43 seconds
      cpu time            10.09 seconds

我运行了几次,运行时间相似.排序时间大致相等并且速度更快.

I ran these a few times and the run times were similar. Sorted times are about equal and way faster.

组越多/每组越少,未排序的速度越快,但查看总 CPU 使用率,它会更高.我的笔记本电脑有一个非常快的 SSD,所以 IO 可能不是限制因素——HD 能够跟上多核 CPU 的需求.在 HD 较慢的系统上,总运行时间可能不同.

The more groups / less per group was faster unsorted, but look at the total CPU usage, it is higher. My laptop has an extremely fast SSD so IO was probably not the limiting factor -- the HD was able to keep up with the multi-core CPU's demands. On a system with a slower HD, the total run times could be different.

最后,它在很大程度上取决于数据的结构以及服务器和数据库的具体情况.

In the end, it depends too much on how the data is structured and the specifics of your server and DB.

这篇关于在 SAS,proc summary 中哪些统计数据计算得更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆