当键列数不同时合并data.table [英] merge data.table when the number of key columns are different

查看:114
本文介绍了当键列数不同时合并data.table的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从文档中了解 data.table 中的逻辑,有点不清楚。我知道我可以试试这个,看看会发生什么,但我想确保没有病理情况,因此想知道逻辑如何实际编码。当两个 data.table 对象具有不同数量的键列时,例如 a 有2和 b 有3,并且您运行 c < - a [b] ,将 a b 简单地合并到前两个键列,或者a中的第三列自动合并到 b中的第三个键列?示例:

I am attempting to understand the logic in the data.table from the documentation and a bit unclear. I know I can just try this and see what happens but I would like to make sure that there is no pathological case and therefore would like to know how the logic was actually coded. When two data.table objects have a different number of key columns, for example a has 2 and b has 3, and you run c <- a[b], will a and b be merged simply on the first two key columns or will the third column in a be automatically merged to the 3rd key column in b? An example:

require(data.table)
a <- data.table(id=1:10, t=1:20, v=1:40, key=c("id", "t"))
b <- data.table(id=1:10, v2=1:20, key="id")
c <- a[b]

b 中的 id 键列匹配的code> a 。例如,对于中的 id == 1 中有2行,并在 a 中生成4行,应在 c 中生成8行。这确实是什么似乎发生:

This should select rows of a that match the id key column in b. For example, for id==1 in b, there are 2 rows in b and 4 rows in a that should generate 8 rows in c. This is indeed what seems to happen:

> head(c,10)
    id  t  v v2
 1:  1  1  1  1
 2:  1  1 21  1
 3:  1 11 11  1
 4:  1 11 31  1
 5:  1  1  1 11
 6:  1  1 21 11
 7:  1 11 11 11
 8:  1 11 31 11
 9:  2  2  2  2
10:  2  2 22  2

另一种尝试方法是do:

The other way to try it is to do:

d <-b[a]

这应该做同样的事情:对于 a 中的每一行,应该选择 b :由于 a 有一个额外的键列, t 用于匹配和仅基于第一个键列的连接,应该完成 id 。似乎是这种情况:

This should do the same thing: for every row in a it should select the matching row in b: since a has an extra key column, t, that column should not be used for matching and a join based only on the first key column, id should be done. It seems like this is the case:

> head(d,10)
    id v2  t  v
 1:  1  1  1  1
 2:  1 11  1  1
 3:  1  1  1 21
 4:  1 11  1 21
 5:  1  1 11 11
 6:  1 11 11 11
 7:  1  1 11 31
 8:  1 11 11 31
 9:  2  2  2  2
10:  2 12  2  2

有人可以确认吗?要清楚:是在任何合并中使用的 a 的第三个键列,或者只有 data.table 使用两个表的 min(length(key(DT)))

Can someone confirm? To be clear: is the third key column of a ever used in any of the merges or does data.table only use the min(length(key(DT))) of the two tables.

推荐答案

好问题。首先正确的术语是(来自?data.table ):


data.table]可以具有一个或多个一个键。此键可用于行索引而不是rownames。

[A data.table] may have one key of one or more columns. This key can be used for row indexing instead of rownames.

所以key(单数)不是keys(复数)。我们可以脱离与钥匙,目前。但是当将来添加辅助键时,可以是多个键。每个键(单数)可以有多个(复数)。

So "key" (singlular) not "keys" (plural). We can get away with "keys", currently. But when secondary keys are added in future, there may then be multiple keys. Each key (singular) can have multiple columns (plural).

否则你是绝对正确的。以下段落在v1.8.2基础上得到改进,基于其他人的反馈也困惑。从?data.table

Otherwise you're absolutely correct. The following paragraph was improved in v1.8.2 based on feedback from others also confused. From ?data.table:


当i是data.table时,x必须有一个键。 i使用x的键连接到x,并返回匹配的x中的行。在x的键中的每个列之间执行等连接;即i的列1与x的键的第1列,第2列的第2列匹配等。匹配是O(log n)时间中编译的C中的二进制搜索。如果i的列少于x的键,那么x的许多行通常会匹配i的每一行,因为不是所有的x的键列都将被加入(常见的用例)。如果i具有比x的键多的列,则不涉及连接的列包括在结果中。如果我也有一个键,它是我的键列,用于匹配x的键列(i的键的列1连接到x的键的列1,列2到列2,等等)和二进制合并的两个表进行。在所有连接中,列的名称不相关。

When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then many rows of x will ordinarily match to each row of i since not all of x's key columns will be joined to (a common use case). If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 to column 2, and so on) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant. The columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key.


$ b。x的键的列按顺序连接,从i的第1列开始,i未键,或从i键的第1列开始。 $ b




以下注释,在v1.8.3中(在R-Forge上)现在读取(粗体变化):


Following comments, in v1.8.3 (on R-Forge) this now reads (changes in bold) :


当i是一个data.table时,x必须有一个键。 i使用x的键连接到x,并返回匹配的x中的行。在x的键中的每个列之间执行等连接;即i的列1与x的键的第1列,第2列的第2列匹配等。匹配是O(log n)时间中编译的C中的二进制搜索。如果i的列少于x的键,那么不是所有的x的键列都将被连接到(一个常见的用例),并且x的许多行将(通常)与i的每一行匹配。如果i具有比x的键多的列,则不涉及连接的列包括在结果中。如果我也有一个键,它是我的键列,用于匹配x的键列(i的键的列1连接到x的键的列1,我的键的第2列到x键的列2 ,等等,只要较短的键),并且执行两个表的二进制合并。在所有联接中,列的名称是不相关的; x的键的列按顺序连接,当i未键入时,从i的第1列开始,或从i键的第1列开始。 在代码中,连接列的数量由min(length(key(x)),if(haskey(i))length(key(i))else ncol(i) / p>

When i is a data.table, x must have a key. i is joined to x using x's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x's key; i.e., column 1 of i is matched to the 1st column of x's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x's key then not all of x's key columns will be joined to (a common use case) and many rows of x will (ordinarily) match to each row of i. If i has more columns than x's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i's key columns that are used to match to x's key columns (column 1 of i's key is joined to column 1 of x's key, column 2 of i's key to column 2 of x's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of x's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i's key. In code, the number of join columns is determined by min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)).

这篇关于当键列数不同时合并data.table的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆