.EACHI 在 data.table 中? [英] .EACHI in data.table?

查看：31 发布时间：2021/12/8 11:33:56 r performance group-by data.table

本文介绍了.EACHI 在 data.table 中?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我似乎找不到关于 .EACHI 在 data.table 中究竟做了什么的任何文档.我在文档中看到了一个简短的提及:

I cannot seem to find any documentation on what exactly .EACHI does in data.table. I see a brief mention of it in the documentation:

聚合已知组的子集特别有效在 i 中传递这些组并设置 by=.EACHI 时.当 i 是一个data.table, DT[i,j,by=.EACHI] 为 DT 的组计算 ji 中的每一行都加入.我们称这种分组为每个 i.

Aggregation for a subset of known groups is particularly efficient when passing those groups in i and setting by=.EACHI. When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i.

但是DT 上下文中的组"是什么意思?组是由DT 上设置的键决定的吗?该组是每个使用所有列作为键的不同行吗?我完全理解如何运行诸如 DT[i,j,by=my_grouping_variable] 之类的东西，但对 .EACHI 的工作方式感到困惑.有人能解释一下吗?

But what does "groups" in the context of DT mean? Is a group determined by the key that is set on DT? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something like DT[i,j,by=my_grouping_variable] but am confused as to how .EACHI would work. Could someone explain please?

推荐答案

我已将此添加到列表中此处.希望我们能够按计划交付.

I've added this to the list here. And hopefully we'll be able to deliver as planned.

原因很可能是 by=.EACHI 是最近的功能(自 1.9.4 起)，但它的作用不是.让我用一个例子来解释.假设我们有两个数据表 X 和 Y:

The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:

X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")

我们知道我们可以通过X[Y]加入.这类似于 subset 操作，但使用 data.tables(而不是整数/行名称或逻辑值).对于Y中的每一行，取Y的关键列，在X的关键列(+列)中查找并返回对应的匹配行在 Y) .

We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .

X[Y]
#    x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a

现在假设我们想要，对于 Y 的关键列(这里只有一个关键列)中的每一行，我们想要获得计数 X 中的匹配项.在 data.table < 的版本中1.9.4，我们可以通过在j中简单地指定.N来做到这一点，如下所示:

Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:

# < 1.9.4
X[Y, .N]
#    x N
# 1: 2 2
# 2: 6 1

这个隐式的作用是，在j存在的情况下，对X的每个匹配结果计算j-expression(对应于Y中的行).这被称为by-without-by 或implicit-by，因为它好像有一个隐藏的.


What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by. 
问题是这将始终执行 by 操作.所以，如果我们想知道连接后的行数，那么我们必须这样做: X[Y][ .N] (或简单地 nrow(X[Y]) 在这种情况下).也就是说，如果我们不想要 by-without-by，我们就不能在同一个调用中使用 j 表达式.结果，当我们做例如 X[Y, list(z)] 时，它使用 by-without-by 评估 list(z)代码>，因此速度稍慢.
The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.
此外，data.table 用户要求这是显式 - 参见 这个 和这个 了解更多上下文.
Additionally data.table users requested this to be explicit - see this and this for more context.
因此添加了 by=.EACHI.现在，当我们这样做时:
Hence by=.EACHI was added. Now, when we do:
X[Y, .N]
# [1] 3

它会做它应该做的事情(避免混淆).它返回连接产生的行数.
it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.
还有，
X[Y, .N, by=.EACHI]

对 Y 中每一行的匹配行计算 j 表达式(对应于此处 Y 键列的值).使用 which=TRUE 会更容易看到这一点.
evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.
X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7

如果我们为每个运行 .N，那么我们应该得到 2,1.
If we run .N for each, then we should get 2,1.
X[Y, .N, by=.EACHI]
#    x N
# 1: 2 2
# 2: 6 1

所以我们现在拥有这两个功能.希望这会有所帮助.
So we now have both functionalities. Hope this helps.

                        这篇关于.EACHI 在 data.table 中?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

.EACHI 在 data.table 中? [英] .EACHI in data.table?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

.EACHI 在 data.table 中? [英] .EACHI in data.table?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭