SAS中的自动分组,最小化组内差异 [英] Automated grouping in SAS with minimizing variance within group
问题描述
所以我尝试构建自动分组.目标是选择方差最小的分组设置.
So I tried to build the automated grouping. The goal is to select the grouping setting that has the lowest variance.
也就是说,我想为下面找到x和y,x,y是自然数,
In other word, I want to find x and y for the following, x,y are natural number,
GROUP 1: 1997 - x
GROUP 2: x+1 - y
GROUP 3: y+1 - 1994
使得 (variance(Response
in Group1),variance(Response
in Group2),variance(Response
in Group3)的和) 最小化.
such that the SUM of (variance(Response
in Group1),variance(Response
in Group2),variance(Response
in Group3)) are minimize.
data maindat;
input Year Response ;
datalines;
1994 -4.300511714
1994 -9.646920963
1994 -15.86956805
1993 -16.14857235
1993 -13.05797186
1993 -13.80941206
1992 -3.521394503
1992 -1.102526302
1992 -0.137573583
1992 2.669238665
1992 -9.540489193
1992 -19.27474303
1992 -3.527077011
1991 1.676464068
1991 -2.238822314
1991 4.663079037
1991 -5.346920963
1990 -8.543723186
1990 0.507460641
1990 0.995302284
1990 0.464194011
1989 4.728791571
1989 5.578685423
1988 2.771297564
1988 7.109159247
1987 15.96059456
1987 2.985292226
1986 -4.301136971
1985 5.854674875
1985 5.797294021
1984 4.393329025
1983 -6.622580905
1982 0.268500302
1977 12.23062252
;
run;
我的想法是我会有 2 个循环(嵌套)
My idea is that I'll have 2 do loop (nested)
1st do loop (1st iteration): Group 1 1977 - 1977 1977 - 1977 1977 - 1977 … 1977 - 1977
2nd do loop: Group 2 1978 - 1978 1978 - 1979 1978 - 1980 … 1978 - 1993
Else: Group 3 1979 - 1994 1980 - 1994 1981 - 1994 … 1994 - 1994
1st do loop (2nd iteration): Group 1 1977 - 1978 1977 - 1978 1977 - 1978 … 1977 - 1978
2nd do loop: Group 2 1979 - 1979 1979 - 1980 1979 - 1981 … 1979 - 1993
Else Group 3 1980 - 1994 1981 - 1994 1982 - 1994 … 1994 - 1994
...
1st do loop (n-1th iteration) Group 1 1977 - 1991 1977 - 1991
2nd do loop: Group 2 1992 - 1992 1992 - 1993
Else Group 3 1993 - 1994 1994 - 1994
1st do loop (nth iteration) Group 1 1977 - 1992
2nd do loop: Group 2 1993 - 1993
Else Group 3 1994 - 1994
然后,我将只选择提供 3 个组的方差(组内响应)之和中最小的分组设置.
Then I'll just select the grouping setting that provide the smallest of the sum of the variance(response within the group) of 3 groups.
推荐答案
这是一个手动的、详尽的方法.如前所述,这应该可以解决您的问题,但如果您想要更多组或拥有更大的数据,这不是解决问题的好方法.
Here is a manual, exhaustive approach. This should solve your problem as stated, but is not a good way of approaching the problem if you want more groups, or have larger data.
我确信使用其中一个 proc 有一种更明智的方法,但没有立即想到.
I'm sure there is a more sensible approach using one of the procs but nothing springs to mind immediately.
/* Get the year bounds */
proc sql noprint;
select min(year), max(year)
into :yMin, :yMax
from maindat;
quit;
/* Get all the boundaries */
data cutoffs;
do min = &yMin. to &yMax.;
do max = min + 1 to &yMax. + 1;
output;
end;
end;
run;
proc sql;
/* Calculate all the variances */
create table vars as
select
a.*,
var(b.Response) as var
from cutoffs as a
left join maindat as b
on a.min <= b.year < a.max
group by a.min, a.max;
/* Get the sum of the variances for each set of 3 groups */
create table want as
select
a.min as a,
b.min as b,
c.min as c,
c.max as d,
sum(a.var, b.var, c.var) as sumVar
from vars as a
left join vars as b
on a.max = b.min
left join vars as c
on b.max = c.min
where a.min = &yMin. and c.max = &yMax. and a.var and b.var and c.var
order by a.min, b.min, c.min;
/* Output your answer (combine with previous step if you don't want the list) */
select *
from want
where sumVar in (select min(sumVar) from want);
quit;
这篇关于SAS中的自动分组,最小化组内差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!