在 SAS 中进行合并(使用 IN=) [英] Working of Merge in SAS (with IN=)

查看:201
本文介绍了在 SAS 中进行合并(使用 IN=)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个 dataset data1 和 data2

I have two dataset data1 and data2

data data1; 
input sn id $;
datalines;
1 a
2 a
3 a
;
run;

data data2; 
input id $ sales x $;
datalines;
a 10 x
a 20 y 
a 30 z
a 40 q
;
run;

我正在从以下代码合并它们:

I am merging them from below code:

data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;

结果:(我期待的是内部连接结果,但事实并非如此)

Result: (I was expecting an Inner Join result which is not the case)

1   a   10  x
2   a   20  y
2   a   30  z
2   a   40  w

proc sql 内连接的结果.

proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;

结果:(正如预期的内连接)

Result: (As expected from an inner join)

a  1  10  x
a  1  20  y
a  1  30  z
a  1  40  w
a  2  10  x
a  2  20  y
a  2  30  z
a  2  40  w
b  3  10  x
b  3  20  y
b  3  30  z
b  3  40  w

我想知道 merge 在 SAS 中使用 In=概念STEP BY STEP工作并证明上述结果.

I want to know the concept and STEP BY STEP working of merge statement in SAS with In= and proving the above result.

PS:我已经阅读了这个,它说

PS: I have read this, and it says

这些变量的一个明显用途是控制哪种合并"会发生,使用 if 语句.例如,如果ThisRecordIsFromYourData 和 ThisRecordIsFromOtherData;将使 SAS仅包含与两个输入数据中的 by 变量匹配的行集合(如内连接).

An obvious use for these variables is to control what kind of 'merge' will occur, using if statements. For example, if ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS only include rows that match on the by variables from both input data sets (like an inner join).

我猜,(如内连接)并非总是如此.

which I guess, (like an Inner Join) is not always the case.

推荐答案

基本上,这是由于 SAS 数据步骤和 SQL 处理各自连接/合并的方式不同造成的.

Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.

SQL 为每个可能的键组合创建一个单独的记录.这是笛卡尔积(在关键级别).

SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).

然而,SAS 数据步骤的合并过程非常不同.MERGE 实际上只不过是 SET 的一个特例.它仍然迭代地处理行,一次一个 - 它永远不会返回,并且一次从 PDV 中的任何数据集都不会超过一行.因此,它无法在其正常过程中创建笛卡尔积 - 这将需要随机访问,而 SAS 数据步骤通常不会这样做.

SAS data step, however, process merges very differently. MERGE is really nothing more than a special case of SET. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.

它的作用:

For each unique BY value
  Take the next record from the left side dataset, if one exists with that BY value
  Take the next record from the right side dataset, if one exists with that BY value
  Output a row
Continue until both datasets are exhausted for that BY value

BY 值在任一侧(或两侧)为每个值生成唯一记录,它实际上与 SQL 相同.然而,如果 BY 值在两边产生重复,你就会得到你所拥有的:并排合并,如果一个在另一个之前用完,来自较短数据​​集的最后一行的值(对于那个值)或多或少地被复制下来.(它们实际上是保留的,因此如果您用更改覆盖它们,它们将不会在来自较长数据集的新记录上重置).

With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).

因此,如果 left 有 3 条记录,right 有 4 条记录用于键值 a,就像在您的示例中一样,那么您将获得数据来自以下记录(假设您之后没有更改数据):

So, if left has 3 records and right has 4 records for key value a, like in your example, then you get data from the following records (assuming you don't alter the data after):

left  right
1     1
2     2
3     3
3     4

这篇关于在 SAS 中进行合并(使用 IN=)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆