.EACHI在data.table中 [英] .EACHI in data.table
问题描述
我似乎找不到任何文件,在 .EACHI
在 data.table
。我在文档中看到它的简要说明:
已知组的子集的聚合特别有效
当传递那些group in i and settingby = .EACHI
。当i
是
数据表,DT [i,j,by = .EACHI]
d
的组j
i
加入。但是DT 上下文中的groups是什么, code>是什么意思?是由
DT
上设置的键确定的组吗?是组的每个不同的行,使用所有列作为关键吗?我完全理解如何运行DT [i,j,by = my_grouping_variable]
,但是困惑为.EACHI
会工作。解决方案我已将此添加到列表此处。
原因很可能是
by = .EACHI
是最近的功能(自1.9.4版本起),但是它不是。让我用一个例子解释。假设我们有两个data.tablesX
和Y
:X = data.table(x = c(1,1,1,2,2,5,6),y = 1:7,key =x)
Y = data.table(x = c(2,6),z = letters [2:1],key =x)
我们知道我们可以通过
X [Y]
加入。这类似于子集操作,但使用data.tables
(而不是整数/行名或逻辑值)。对于Y
中的每一行,使用Y
的键列,它会查找并返回X
的键列(Y
中的+列)。X [Y]
#xyz
#1:2 4 b
#2:2 5 b
#3:6 7 a
现在假设我们想从
Y
的键列(这里只有一个键列),我们希望获得
X
中匹配项的计数。在data.table
版本< 1.9.4 ,我们可以通过简单地在j
中指定.N
,如下所示:#< 1.9.4
X [Y,.N]
#x N
#1:2 2
#2:6 1
这隐含的是在
j
在X
(对应于Y中的行)的每个匹配结果上评估
)。这被称为 by-without-by 或隐式,因为它好像有一个隐藏。j-expression
问题是这总是会执行
by
操作。所以,如果我们想知道连接之后的行数,那么我们必须这样做:X [Y] [.N]
(或简单地nrow(X [Y])
)。也就是说,如果我们不想要一个by-without-by $ c>,那么我们不能在同一个调用中使用
j
$ c>。因此,当我们做例如X [Y,list(z)]
时,它评估list(z)
使用by-without-by
,因此稍慢。
因此,添加了
by = .EACHI
。现在,当我们这样做:X [Y,.N]
#[1] 3
它可以做到这一点(避免混淆)。它返回连接所产生的行数。
和
X [Y,.N,by = .EACHI]
Y
(对应于来自Y
这里的关键列)。使用which = TRUE
更容易看到这一点。X [。(2),which = TRUE]#[1] 4 5
/ pre>
X [。(6),which = TRUE]#[1] 7
如果我们对每个都运行
.N
,那么我们应该得到2,1。
X [Y,.N,by = .EACHI]
#x N
#1:2 2
#2:6 1
所以我们现在有两个功能。希望这有帮助。
I cannot seem to find any documentation on what exactly
.EACHI
does indata.table
. I see a brief mention of it in the documentation:Aggregation for a subset of known groups is particularly efficient when passing those groups in i and setting
by=.EACHI
. Wheni
is a data.table,DT[i,j,by=.EACHI]
evaluatesj
for the groups ofDT
that each row ini
joins to. We call this grouping by each i.But what does "groups" in the context of
DT
mean? Is a group determined by the key that is set onDT
? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something likeDT[i,j,by=my_grouping_variable]
but am confused as to how.EACHI
would work. Could someone explain please?解决方案I've added this to the list here. And hopefully we'll be able to deliver as planned.
The reason is most likely that
by=.EACHI
is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tablesX
andY
:X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x") Y = data.table(x = c(2,6), z = letters[2:1], key = "x")
We know that we can join by doing
X[Y]
. this is similar to a subset operation, but usingdata.tables
(instead of integers / row names or logical values). For each row inY
, takingY
's key columns, it finds and returns corresponding matching rows inX
's key columns (+ columns inY
) .X[Y] # x y z # 1: 2 4 b # 2: 2 5 b # 3: 6 7 a
Now let's say we'd like to, for each row from
Y
's key columns (here only one key column), we'd like to get the count of matches inX
. In versions ofdata.table
< 1.9.4, we can do this by simply specifying.N
inj
as follows:# < 1.9.4 X[Y, .N] # x N # 1: 2 2 # 2: 6 1
What this implicitly does is, in the presence of
j
, evaluate thej-expression
on each matched result ofX
(corresponding to the row inY
). This was called by-without-by or implicit-by, because it's as if there's a hidden by.The issue was that this'll always perform a
by
operation. So, if we wanted to know the number of rows after a join, then we'd have to do:X[Y][ .N]
(or simplynrow(X[Y])
in this case). That is, we can't have thej
expression in the same call if we don't want aby-without-by
. As a result, when we did for exampleX[Y, list(z)]
, it evaluatedlist(z)
usingby-without-by
and was therefore slightly slower.Additionally
data.table
users requested this to be explicit - see this and this for more context.Hence
by=.EACHI
was added. Now, when we do:X[Y, .N] # [1] 3
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
And,
X[Y, .N, by=.EACHI]
evaluates
j
-expression on the matching rows for each row inY
(corresponding to value fromY
's key columns here). It'd be easier to see this by usingwhich=TRUE
.X[.(2), which=TRUE] # [1] 4 5 X[.(6), which=TRUE] # [1] 7
If we run
.N
for each, then we should get 2,1.X[Y, .N, by=.EACHI] # x N # 1: 2 2 # 2: 6 1
So we now have both functionalities. Hope this helps.
这篇关于.EACHI在data.table中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!