.EACHI在data.table中? [英] .EACHI in data.table?

查看:165
本文介绍了.EACHI在data.table中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎无法找到有关 .EACHI data.table 中具体含义的文档。我在文档中看到了一个简短的提及:


当通过这些组时,已知组的子集的聚合特别有效
在i中设置组,并通过设置 = .EACHI 。当 i
data.table时, DT [i,j,by = .EACHI] 会评估 j 用于 DT 组中的
i 加入。我们把这个分组称为每个我。

但是在 DT 是什么意思?是由在 DT 上设置的键确定的组?组中的每一行都使用所有列作为关键字?我完全理解如何运行诸如 DT [i,j,by = my_grouping_variable] 之类的东西,但却对 .EACHI 会工作。有人可以解释一下吗?

解决方案

我已将此添加到列表这里。希望我们能够按计划交付。






原因很可能是 by = .EACHI 是最近的一个功能(自1.9.4开始),但它所做的不是。让我用一个例子来解释。假设我们有两个data.tables X Y

  X = data.table(x = c(1,1,1,2,2,5,6),y = 1:7,key =x)
Y = data.table(x = c(2,6),z = letters [2:1],key =x)

我们知道我们可以通过做 X [Y] 来加入。这与子集操作类似,但使用 data.tables (而不是整数/行名或逻辑值)。对于 Y 中的每一行,取 Y 的键列,它会查找并返回 X 的键列(在 Y 中有+列)。

  X [Y] 
#xyz
#1:2 4 b
#2:2 5 b
#3:6 7 a

现在让我们假设,对于 Y 的键列(这里只有一个键列),我们希望得到 X 中匹配的 count 。在 data.table < 1.9.4 ,我们可以通过简单地在 j 中指定 .N 来实现,如下所示: p>

 #< 1.9.4 
X [Y,.N]
#x N
#1:2 2
#2:6 1

这个隐含地所做的是,在 j X 的每个匹配结果评估 j表达式(对应于 Y中的行)。这被称为 by-without-by 隐式,因为它好像有一个隐藏的。



问题在于,这总是会通过操作执行操作。因此,如果我们想知道连接后的行数,那么我们必须这样做: X [Y] [.N] (或简单地 nrow(X [Y])在这种情况下)。也就是说,如果我们不想要 by-without-by-by j 表达式$ C>。因此,当我们例如 X [Y,list(z)] 时,它评估 list(z)使用 by-without-by ,因此稍微慢一些。

另外数据。表用户要求这是显式 - 请参阅 以了解更多上下文。

因此添加了 by = .EACHI 。现在,当我们做:

  X [Y,.N] 
#[1] 3

它完成了它的目的(避免混淆)。它返回连接产生的行数。



而且,

  X [Y,.N,by = .EACHI] 

评估 Y (对应于 Y >这里的关键列)。通过使用 which = TRUE 可以更容易地看到它。

  X [。(2),which = TRUE]#[1] 4 5 
X [。(6),which = TRUE]#[1] 7

如果我们为每个运行 .N ,那么我们应该得到2,1。

  X [Y,.N,by = .EACHI] 
#x N
#1:2 2
#2:6 1

所以我们现在有两个功能。希望这有助于。


I cannot seem to find any documentation on what exactly .EACHI does in data.table. I see a brief mention of it in the documentation:

Aggregation for a subset of known groups is particularly efficient when passing those groups in i and setting by=.EACHI. When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i.

But what does "groups" in the context of DT mean? Is a group determined by the key that is set on DT? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something like DT[i,j,by=my_grouping_variable] but am confused as to how .EACHI would work. Could someone explain please?

解决方案

I've added this to the list here. And hopefully we'll be able to deliver as planned.


The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:

X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")

We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .

X[Y]
#    x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a

Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:

# < 1.9.4
X[Y, .N]
#    x N
# 1: 2 2
# 2: 6 1

What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.

The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.

Additionally data.table users requested this to be explicit - see this and this for more context.

Hence by=.EACHI was added. Now, when we do:

X[Y, .N]
# [1] 3

it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.

And,

X[Y, .N, by=.EACHI]

evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.

X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7

If we run .N for each, then we should get 2,1.

X[Y, .N, by=.EACHI]
#    x N
# 1: 2 2
# 2: 6 1

So we now have both functionalities. Hope this helps.

这篇关于.EACHI在data.table中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆