将外键上的 SQL 连接转换为 R data.table 语法 [英] Translating SQL joins on foreign keys to R data.table syntax
问题描述
data.table
包提供许多与 SQL 相同的表处理方法.如果一个表有一个键,则该键由一个或多个列组成.但是一个表不能有多个键,因为它不能同时以两种不同的方式排序.
The data.table
package provides many of the same table handling methods as SQL. If a table has a key, that key consists of one or more columns. But a table can't have more than one key, because it can't be sorted in two different ways at the same time.
在此示例中,X
和 Y
是具有单个键列id"的 data.table
;Y
还有一个非键列x_id".
In this example, X
and Y
are data.table
s with a single key column "id"; Y
also has a non-key column "x_id".
X <- data.table(id = 1:5, a=4:8,key="id")
Y <- data.table(id = c(1,1, 3,5,7), x_id=c(1,4:1), key="id")
以下语法将在其键上连接表:
The following syntax would join the tables on their keys:
X[Y]
如何将以下 SQL 语法转换为 data.table 代码?
How can I translate the following SQL syntax to data.table code?
select * from X join Y on X.id = Y.x_id;
我得到的最接近的是:
Y[X,list(id, x_id),by = x_id,nomatch=0]
但是,这与 SQL 语句执行的内连接不同.
However, this does not do the same inner join as the SQL statement.
这是一个更清晰的例子,其中外键是 y_id,我们希望连接查找 Y2 的值,其中 X2$y_id = Y2$id
.
Here is a more clear example in which the foreign key is y_id, and we want the join to look up values of Y2 where X2$y_id = Y2$id
.
X2 <- data.table(id = 1:5, y_id = c(1,1,2,2,2), key="id")
Y2 <- data.table(id = 1:5, b = letters[1:5], key="id")
我要制作表格:
id y_id b
1 1 "a"
2 1 "a"
3 2 "b"
4 2 "b"
5 2 "b"
类似于以下 kludge 所做的事情:
similar to what is done by the following kludge:
> merge(data.frame(X2), data.frame(Y2), by.x = "y_id", by.y = "id")
y_id id b
1 1 1 a
2 1 2 a
3 2 3 b
4 2 4 b
5 2 5 b
但是,当我这样做时:
X2[Y2, 1:2,by = y_id]
我没有得到想要的结果:
I do not get the desired result:
y_id V1
[1,] 1 1
[2,] 1 2
[3,] 2 1
[4,] 2 2
推荐答案
好问题.请注意 ?data.table
中的以下内容(不可否认):
Good question. Note the following (admittedly buried) in ?data.table
:
当 i
是 data.table
时,x
必须有一个键.i
使用键连接到 x
并返回 x
中匹配的行.i
中的每一列与 x
的键中的每一列之间执行 equi-join.匹配是在 O(log n) 时间内编译的 C 中的二进制搜索.如果 i
的列数少于 x
的键,则 x
的许多行可能与 i
的每一行匹配.如果 i
的列多于 x
的键,则 i
中未参与连接的列将包含在结果中.如果i
也有key,就是i
的key列用来匹配x
's 键列和两个表的二进制合并.
When
i
is adata.table
,x
must have a key.i
is joined tox
using the key and the rows inx
that match are returned. An equi-join is performed between each column ini
to each column inx
's key. The match is a binary search in compiled C in O(log n) time. Ifi
has less columns thanx
's key then many rows ofx
may match to each row ofi
. Ifi
has more columns thanx
's key, the columns ofi
not involved in the join are included in the result. Ifi
also has a key, it isi
's key columns that are used to match tox
's key columns and a binary merge of the two tables is carried out.
所以,这里的关键是 i
不必键入.只有 x
必须键入.
So, the key here is that i
doesn't have to be keyed. Only x
must be keyed.
X2 <- data.table(id = 11:15, y_id = c(14,14,11,12,12), key="id")
id y_id
[1,] 11 14
[2,] 12 14
[3,] 13 11
[4,] 14 12
[5,] 15 12
Y2 <- data.table(id = 11:15, b = letters[1:5], key="id")
id b
[1,] 11 a
[2,] 12 b
[3,] 13 c
[4,] 14 d
[5,] 15 e
Y2[J(X2$y_id)] # binary search for each item of (unsorted and unkeyed) i
id b
[1,] 14 d
[2,] 14 d
[3,] 11 a
[4,] 12 b
[5,] 12 b
或者,
Y2[SJ(X2$y_id)] # binary merge of keyed i, see ?SJ
id b
[1,] 11 a
[2,] 12 b
[3,] 12 b
[4,] 14 d
[5,] 14 d
identical(Y2[J(X2$y_id)], Y2[X2$y_id])
[1] FALSE
这篇关于将外键上的 SQL 连接转换为 R data.table 语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!