SAS 中的合并工作(使用 IN=) [英] Working of Merge in SAS (with IN=)
问题描述
我有两个 dataset
data1 和 data2
I have two dataset
data1 and data2
data data1;
input sn id $;
datalines;
1 a
2 a
3 a
;
run;
data data2;
input id $ sales x $;
datalines;
a 10 x
a 20 y
a 30 z
a 40 q
;
run;
我正在从下面的代码中合并它们:
I am merging them from below code:
data join;
merge data1(in=a) data2(in=b);
by id;
if a and b;
run;
结果:(我期待的是 Inner Join 结果,但事实并非如此)
Result: (I was expecting an Inner Join result which is not the case)
1 a 10 x
2 a 20 y
2 a 30 z
2 a 40 w
proc sql
内部连接的结果.
proc sql;
select data1.id,sn,sales,x from data2 inner join data1 on data1.hh_id;
quit;
结果:(正如内部连接所预期的那样)
Result: (As expected from an inner join)
a 1 10 x
a 1 20 y
a 1 30 z
a 1 40 w
a 2 10 x
a 2 20 y
a 2 30 z
a 2 40 w
b 3 10 x
b 3 20 y
b 3 30 z
b 3 40 w
我想知道merge
语句在SAS中In=
的概念和STEP BY STEP的工作原理并证明上述结果.
I want to know the concept and STEP BY STEP working of merge
statement in SAS with In=
and proving the above result.
PS:我已阅读 this,上面写着
PS: I have read this, and it says
这些变量的一个明显用途是控制什么样的合并"将发生,使用 if 语句.例如,如果ThisRecordIsFromYourData 和 ThisRecordIsFromOtherData;将使SAS仅包含与两个输入数据中的 by 变量匹配的行集合(如内部连接).
An obvious use for these variables is to control what kind of 'merge' will occur, using if statements. For example, if ThisRecordIsFromYourData and ThisRecordIsFromOtherData; will make SAS only include rows that match on the by variables from both input data sets (like an inner join).
我猜,(如内部连接)并非总是如此.
which I guess, (like an Inner Join) is not always the case.
推荐答案
基本上,这是由于 SAS 数据步骤和 SQL 处理各自连接/合并的方式不同造成的.
Basically, this is a result of the difference in how the SAS data step and SQL process their respective join/merges.
SQL 为每个可能的键组合创建一个单独的记录.这是笛卡尔积(在关键级别).
SQL creates a separate record for each possible combination of keys. This is a Cartesian Product (at the key level).
但是,SAS 数据步骤的过程合并非常不同.MERGE
实际上只不过是 SET
的一个特例.它仍然迭代地处理行,一次一个 - 它永远不会返回,并且从 PDV 中的任何数据集中一次永远不会超过一行.因此,它无法在其正常过程中创建笛卡尔积 - 这需要随机访问,而 SAS 数据步通常不会这样做.
SAS data step, however, process merges very differently. MERGE
is really nothing more than a special case of SET
. It still processes rows iteratively, one at a time - it never goes back, and never has more than one row from any dataset in the PDV at once. Thus, it cannot create a Cartesian product in its normal process - that would require random access, which the SAS datastep doesn't do normally.
它的作用:
For each unique BY value
Take the next record from the left side dataset, if one exists with that BY value
Take the next record from the right side dataset, if one exists with that BY value
Output a row
Continue until both datasets are exhausted for that BY value
使用 BY 值在任一侧(或两侧)产生每个值的唯一记录,它实际上与 SQL 相同.但是,使用 BY 值会在两侧产生重复项,您将得到您所拥有的:并排合并,如果一个在另一个之前用完,则来自较短数据集的最后一行的值(为此值)或多或少地被复制下来.(它们实际上是保留的,因此如果您用更改覆盖它们,它们将不会在较长数据集中的新记录上重置).
With BY values that yield unique records per value on either side (or both), it is effectively identical to SQL. However, with BY values that yield duplicates on BOTH sides, you get what you have there: a side-by-side merge, and if one runs out before the other, the values from the last row of the shorter dataset (for that by value) are more-or-less copied down. (They're actually RETAINED, so if you overwrite them with changes, they will not reset on new records from the longer dataset).
所以,如果 left
有 3 条记录,而 right
有 4 条记录键值 a
,就像在您的示例中一样,那么您将获得数据来自以下记录(假设您之后不更改数据):
So, if left
has 3 records and right
has 4 records for key value a
, like in your example, then you get data from the following records (assuming you don't alter the data after):
left right
1 1
2 2
3 3
3 4
这篇关于SAS 中的合并工作(使用 IN=)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!