SAS 中的跟踪 ID [英] Tracking ID in SAS
问题描述
我有一个 SAS 问题.我有一个大型数据集,其中包含时间序列中每年的唯一 ID 和一堆变量.一些 ID 存在于整个时间序列中,一些新 ID 被添加,一些旧 ID 被删除.
I have a SAS question. I have a large dataset containing unique ID's and a bunch of variables for each year in a time series. Some ID's are present throughout the entire timeseries, some new ID's are added and some old ID's are removed.
ID Year Var3 Var4
1 2015 500 200
1 2016 600 300
1 2017 800 100
2 2016 200 100
2 2017 100 204
3 2015 560 969
3 2016 456 768
4 2015 543 679
4 2017 765 534
从上表可以看出,ID 1 存在于所有三年(2015-2017)中,ID 2 存在于 2016 年及以后,ID 3 在 2017 年被删除,ID 4 在 2015 年存在,被删除在 2016 年,然后在 2017 年再次出现.
As can be seen from the table above, ID 1 is present in all three years (2015-2017), ID 2 is present from 2016 and onwards, ID 3 is removed in 2017 and ID 4 is present in 2015, removed in 2016 and then present again in 2017.
我想知道在任何给定年份哪些 ID 是新的,哪些被删除,同时保留所有数据.例如.一个新表,其中包含 ID 是新的和已删除的指标.此外,获得在给定年份中添加/删除多少 ID' 的频率以及它们的Var3"和Var4"的总和会很好.你有什么建议吗?
I would like to know which ID's are new and which are removed in any given year, whilst keeping all the data. Eg. a new table with indicators for which ID's are new and which are removed. Furthermore, it would be nice to get a frequency of how many ID' are added/removed in a given year and the sum og their "Var3" and "Var4". Do you have any suggestions how to do that?
**************** 更新 ******************
************* UPDATE ******************
好的,所以我尝试了以下程序:
Okay, so I tried the following program:
**** Addition to suggested code ****;
options validvarname=any;
proc sql noprint;
create table years as
select distinct year
from have;
create table ids as
select distinct id
from have;
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
现在这将为我提供一个仅包含 2017 年新 ID 的表格:
Now this will provide me with a table that only contains the ID's that are new in 2017:
data new_in_17;
set indicators;
where ('2016'n=0) and ('2017'n=1);
run;
我现在可以合并这个表来添加 var3 和 var4:
I can now merge this table to add var3 and var4:
data new17;
merge new_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
现在我可以找到2017年新ID的频率以及var3和var4的总和:
Now I can find the frequence of new ID's in 2017 and the sum of var3 and var4:
proc means data=new17 noprint;
var var3 var4;
where year in (2017);
output out=sum_var_freq_new sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
这给了我我需要的输出.但是,我想要在 2016 年和 2017 年之间消失"的 ID 的等效输出,可以从:
This gives me the output I need. However, I would like the equivalent output for the ID's that are "gone" between 2016 and 2017 which can be made from:
data gone_in_17;
set indicators;
where ('2016'n=1) and ('2017'n=0);
run;
data gone17;
merge gone_in_17(in=x1) have(in=x2);
by id;
if x1=x2;
run;
proc means data=gone17 noprint;
var var3 var4;
where year in (2016);
output out=sum_var_freq_gone sum(var3)=sum_var3 sum(var4)=sum_var4;
run;
最终的结果应该是将两个表sum_var_freq_new"和sum_var_freq_gone"组合成一张表.此外,我每个新年都需要这张桌子,所以我目前的方法效率很低.你们对如何有效地实现这一目标有什么建议吗?
The end result should be a combination of the two tables "sum_var_freq_new" and "sum_var_freq_gone" into one table. Furthermore, I need this table for every new year, so my current approach is very inefficient. Do you guys have any suggestions how to achieve this efficiently?
推荐答案
这里有一种更有效的方法来执行此操作,还可以为您提供汇总值.
Here is a more efficient way of doing this and also giving you the summary values.
首先是一点 SQL 魔法.创建年份和 ID 的叉积,然后将其加入您必须创建指标的表中;
First a little SQL magic. Create the cross product of years and IDs, then join that to the table you have to create an indicator;
proc sql noprint;
/*All Years*/
create table years as
select distinct year
from have;
/*All IDS*/
create table ids as
select distinct id
from have;
/*All combinations of ID/year*/
create table all_id_years as
select a.id, b.year
from ids as a,
years as b
order by id, year;
/*Original data with rows added for missing years. Indicator=1 if it*/
/*existed prior, 0 if not.*/
create table indicators as
select coalesce(a.id,b.id) as id,
coalesce(a.year,b.year) as year,
coalesce(a.id/a.id,0) as indicator
from have as a
full join
all_id_years as b
on a.id = b.id
and a.year = b.year
order by id, year
;
quit;
现在转置那个.
proc transpose data=indicators out=indicators(drop=_name_);
by id;
id year;
var indicator;
run;
创建总和.如果需要,您还可以在此处添加其他汇总统计信息:
Create the sums. You could also add other summary stats if you wanted here:
proc summary data=have;
by id;
var var3 var4;
output out=summary sum=;
run;
合并指标和汇总值:
data want;
merge indicators summary(keep=id var3 var4);
by id;
run;
这篇关于SAS 中的跟踪 ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!