按条件在 SAS 中的 2 个数据集之间查找匹配项 [英] Find matches by condition between 2 datasets in SAS
问题描述
我正在尝试通过我的同学和我在使用 SAS 的编程课程中使用的 *.jsl 文件中现有的 for 循环来改进使用的处理时间.我的问题:是否存在 SAS 提供的可以复制搜索和匹配条件的 PROC 或语句序列?还是一种无需逐行查找匹配条件即可浏览未排序文件的方法?
I'm trying to improve the processing time used via an already existing for-loop in a *.jsl file my classmates and I are using in our programming course using SAS. My question: is there a PROC or sequence of statements that exist that SAS offers that can replicate a search and match condition? Or a way to go through unsorted files without going line by line looking for matching condition(s)?
我们当前的脚本文件如下:
Our current scrip file is below:
if( roadNumber_Fuel[n]==roadNumber_TO[m] &
fuelDate[n]>=tripStart[m] & fuelDate[n]<=TripEnd[m],
newtripID[n] = tripID[m];
);
我在下面简化了 2 组数据.
I have 2 sets of data simplified below.
DATA1:
ID1 Date1
1 May 1, 2012
2 Jun 4, 2013
3 Aug 5, 2013
..
.
&
DATA2:
ID2 Date2 Date3 TRIP_ID
1 Jan 1 2012 Feb 1 2012 9876
2 Sep 5 2013 Nov 3 2013 931
1 Dec 1 2012 Dec 3 2012 236
3 Mar 9 2013 May 3 2013 390
2 Jun 1 2013 Jun 9 2013 811
1 Apr 1 2012 May 5 2012 76
...
..
.
我需要检查很多迭代,但我的目标是获得代码检查:
I need to check a lot of iterations but my goal is to have the code check:
Data1.ID1 = Data2.ID2 AND (Date1 >Date2 and Date1 < Date3)
我想要的输出数据集是
ID1 Date1 TRIP_ID
1 May 1, 2012 76
2 Jun 4, 2013 811
感谢您的任何见解!
推荐答案
您可以通过两种方式进行范围匹配.首先,如果您熟悉 SQL,可以使用 PROC SQL
进行匹配:
You can do range matches in two ways. First off, you can match using PROC SQL
if you're familiar with SQL:
proc sql;
create tableC as
select * from table A
left join table B
on A.id=B.id and A.date > B.date1 and A.date < B.date2
;
quit;
其次,您可以创建格式.如果可以这样做,这通常是更快的选择.如果您有 ID,这会很棘手,但您可以做到.
Second, you can create a format. This is usually the faster option if it's possible to do this. This is tricky when you have IDs, but you can do it.
首先,创建一个新变量,ID+date.日期是 18,000-20,000 之间的数字,因此将您的 ID 乘以 100,000 就安全了.
First, create a new variable, ID+date. Dates are numbers around 18,000-20,000, so multiply your ID by 100,000 and you're safe.
其次,从范围数据集创建一个数据集,其中 START=较低日期加上 id*100,000,END=较高日期 + id*100,000,FMTNAME=一些将成为格式名称的字符串(必须以 AZ 或 _ 开头并具有AZ、_、仅数字).LABEL 是您要检索的值(上例中的 Trip_ID).
Second, create a dataset from the range dataset where START=lower date plus id*100,000, END=higher date + id*100,000, FMTNAME=some string that will become the format name (must start with A-Z or _ and have A-Z, _, digits only). LABEL is the value you want to retrieve (Trip_ID in the above example).
data b_fmts;
set b;
start=id*100000+date1;
end =id*100000+date2;
label=value_you_want_out;
fmtname='MYDATEF';
run;
然后使用带有 CNTLIN=` 选项的 PROC FORMAT
导入格式.
Then use PROC FORMAT
with CNTLIN=` option to import formats.
proc format cntlin=b_fmts;
quit;
确保您的日期范围不重叠 - 如果这样做会失败.
Make sure your date ranges don't overlap - if they do this will fail.
然后就可以轻松使用了:
Then you can use it easily:
data a_match;
set a;
trip_id=put(id*100000+date,MYDATEF.);
run;
这篇关于按条件在 SAS 中的 2 个数据集之间查找匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!