SAS中的自动分组,最大程度地减少了组内的差异 [英] Automated grouping in SAS with minimizing variance within group

查看:158
本文介绍了SAS中的自动分组,最大程度地减少了组内的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我尝试构建自动分组。目标是选择方差最小的分组设置。

So I tried to build the automated grouping. The goal is to select the grouping setting that has the lowest variance.

换句话说,我想为以下内容找到x和y,x,y是自然数,

In other word, I want to find x and y for the following, x,y are natural number,

GROUP 1: 1997 - x
GROUP 2: x+1 - y
GROUP 3: y+1 - 1994

使得(variance( Response ),差异(Group2中的响应),差异(Group3中的响应)最小化。

such that the SUM of (variance(Responsein Group1),variance(Response in Group2),variance(Response in Group3)) are minimize.

data maindat;
input  Year Response ;
datalines;
1994    -4.300511714
1994    -9.646920963
1994    -15.86956805
1993    -16.14857235
1993    -13.05797186
1993    -13.80941206
1992    -3.521394503
1992    -1.102526302
1992    -0.137573583
1992    2.669238665
1992    -9.540489193
1992    -19.27474303
1992    -3.527077011
1991    1.676464068
1991    -2.238822314
1991    4.663079037
1991    -5.346920963
1990    -8.543723186
1990    0.507460641
1990    0.995302284
1990    0.464194011
1989    4.728791571
1989    5.578685423
1988    2.771297564
1988    7.109159247
1987    15.96059456
1987    2.985292226
1986    -4.301136971
1985    5.854674875
1985    5.797294021
1984    4.393329025
1983    -6.622580905
1982    0.268500302
1977    12.23062252
;
run;

我的想法是我将有2个do循环(嵌套)

My idea is that I'll have 2 do loop (nested)

1st do loop (1st iteration): Group 1    1977 - 1977    1977 - 1977   1977 - 1977    …   1977 - 1977
2nd do loop:                 Group 2    1978 - 1978    1978 - 1979   1978 - 1980    …   1978 - 1993
Else:                        Group 3    1979 - 1994    1980 - 1994   1981 - 1994    …   1994 - 1994
1st do loop (2nd iteration): Group 1    1977 - 1978    1977 - 1978   1977 - 1978    …   1977 - 1978
2nd do loop:                 Group 2    1979 - 1979    1979 - 1980   1979 - 1981    …   1979 - 1993
Else                         Group 3    1980 - 1994    1981 - 1994   1982 - 1994    …   1994 - 1994
...
1st do loop (n-1th iteration) Group 1   1977 - 1991   1977 - 1991           
2nd do loop:                  Group 2   1992 - 1992   1992 - 1993           
Else                          Group 3   1993 - 1994   1994 - 1994           
1st do loop (nth iteration)   Group 1   1977 - 1992             
2nd do loop:                  Group 2   1993 - 1993             
Else                          Group 3   1994 - 1994             

然后,我将选择分组设置,它提供3组的方差之和(组内响应)的最小值。

Then I'll just select the grouping setting that provide the smallest of the sum of the variance(response within the group) of 3 groups.

推荐答案

这是一种手动的详尽方法。如上所述,这应该可以解决您的问题,但是如果您想要更多的组或拥有更大的数据,则不是解决问题的好方法。

Here is a manual, exhaustive approach. This should solve your problem as stated, but is not a good way of approaching the problem if you want more groups, or have larger data.

我敢肯定,使用其中一种proc可以找到一种更明智的方法,但是没有立即想到的东西。

I'm sure there is a more sensible approach using one of the procs but nothing springs to mind immediately.

/* Get the year bounds */
proc sql noprint;
    select min(year), max(year)
    into :yMin, :yMax
    from maindat;
quit;

/* Get all the boundaries */
data cutoffs;
    do min = &yMin. to &yMax.;
        do max = min + 1 to &yMax. + 1;
            output;
        end;
    end;
run;
proc sql;
    /* Calculate all the variances */
    create table vars as
    select 
        a.*,
        var(b.Response) as var
    from cutoffs as a
    left join maindat as b
        on a.min <= b.year < a.max
    group by a.min, a.max;

    /* Get the sum of the variances for each set of 3 groups */
    create table want as
    select 
        a.min as a,
        b.min as b,
        c.min as c,
        c.max as d,
        sum(a.var, b.var, c.var) as sumVar
    from vars as a
    left join vars as b
        on a.max = b.min
    left join vars as c
        on b.max = c.min
    where a.min = &yMin. and c.max = &yMax. and a.var and b.var and c.var
    order by a.min, b.min, c.min;

    /* Output your answer (combine with previous step if you don't want the list) */
    select * 
    from want
    where sumVar in (select min(sumVar) from want);
quit;

这篇关于SAS中的自动分组,最大程度地减少了组内的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆