.EACHI在data.table中 [英] .EACHI in data.table

查看:158
本文介绍了.EACHI在data.table中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我似乎找不到任何文件,在 .EACHI data.table 。我在文档中看到它的简要说明:


已知组的子集的聚合特别有效
当传递那些group in i and setting by = .EACHI 。当 i
数据表, DT [i,j,by = .EACHI] d 的组 j
i 加入。但是 DT 上下文中的groups是什么, code>是什么意思?是由 DT 上设置的键确定的组吗?是组的每个不同的行,使用所有列作为关键吗?我完全理解如何运行 DT [i,j,by = my_grouping_variable] ,但是困惑为 .EACHI 会工作。

解决方案

我已将此添加到列表此处






原因很可能是 by = .EACHI 是最近的功能(自1.9.4版本起),但是它不是。让我用一个例子解释。假设我们有两个data.tables X Y

  X = data.table(x = c(1,1,1,2,2,5,6),y = 1:7,key =x)
Y = data.table(x = c(2,6),z = letters [2:1],key =x)

我们知道我们可以通过 X [Y] 加入。这类似于子集操作,但使用 data.tables (而不是整数/行名或逻辑值)。对于 Y 中的每一行,使用 Y 的键列,它会查找并返回 X 的键列( Y 中的+列)。

  X [Y] 
#xyz
#1:2 4 b
#2:2 5 b
#3:6 7 a

现在假设我们想从 Y 的键列(这里只有一个键列),我们希望获得 X 中匹配项的计数。在 data.table 版本< 1.9.4 ,我们可以通过简单地在 j 中指定 .N ,如下所示:

 #< 1.9.4 
X [Y,.N]
#x N
#1:2 2
#2:6 1

这隐含的是在 j X (对应于 Y中的行)的每个匹配结果上评估 j-expression )。这被称为 by-without-by 隐式,因为它好像有一个隐藏。



问题是这总是会执行 by 操作。所以,如果我们想知道连接之后的行数,那么我们必须这样做: X [Y] [.N] (或简单地 nrow(X [Y]))。也就是说,如果我们不想要一个 by-without-by ,那么我们不能在同一个调用中使用 j $ c>。因此,当我们做例如 X [Y,list(z)] 时,它评估 list(z)使用 by-without-by ,因此稍慢。



另外用户要求这是显式的 - 请参阅了解更多上下文。



因此,添加了 by = .EACHI 。现在,当我们这样做:

  X [Y,.N] 
#[1] 3

它可以做到这一点(避免混淆)。它返回连接所产生的行数。



  X [Y,.N,by = .EACHI] 

Y (对应于来自 Y 这里的关键列)。使用 which = TRUE 更容易看到这一点。

  X [。(2),which = TRUE]#[1] 4 5 
X [。(6),which = TRUE]#[1] 7
/ pre>

如果我们对每个都运行 .N ,那么我们应该得到2,1。

  X [Y,.N,by = .EACHI] 
#x N
#1:2 2
#2:6 1

所以我们现在有两个功能。希望这有帮助。


I cannot seem to find any documentation on what exactly .EACHI does in data.table. I see a brief mention of it in the documentation:

Aggregation for a subset of known groups is particularly efficient when passing those groups in i and setting by=.EACHI. When i is a data.table, DT[i,j,by=.EACHI] evaluates j for the groups of DT that each row in i joins to. We call this grouping by each i.

But what does "groups" in the context of DT mean? Is a group determined by the key that is set on DT? Is the group every distinct row that uses all the columns as the key? I fully understand how to run something like DT[i,j,by=my_grouping_variable] but am confused as to how .EACHI would work. Could someone explain please?

解决方案

I've added this to the list here. And hopefully we'll be able to deliver as planned.


The reason is most likely that by=.EACHI is a recent feature (since 1.9.4), but what it does isn't. Let me explain with an example. Suppose we have two data.tables X and Y:

X = data.table(x = c(1,1,1,2,2,5,6), y = 1:7, key = "x")
Y = data.table(x = c(2,6), z = letters[2:1], key = "x")

We know that we can join by doing X[Y]. this is similar to a subset operation, but using data.tables (instead of integers / row names or logical values). For each row in Y, taking Y's key columns, it finds and returns corresponding matching rows in X's key columns (+ columns in Y) .

X[Y]
#    x y z
# 1: 2 4 b
# 2: 2 5 b
# 3: 6 7 a

Now let's say we'd like to, for each row from Y's key columns (here only one key column), we'd like to get the count of matches in X. In versions of data.table < 1.9.4, we can do this by simply specifying .N in j as follows:

# < 1.9.4
X[Y, .N]
#    x N
# 1: 2 2
# 2: 6 1

What this implicitly does is, in the presence of j, evaluate the j-expression on each matched result of X (corresponding to the row in Y). This was called by-without-by or implicit-by, because it's as if there's a hidden by.

The issue was that this'll always perform a by operation. So, if we wanted to know the number of rows after a join, then we'd have to do: X[Y][ .N] (or simply nrow(X[Y]) in this case). That is, we can't have the j expression in the same call if we don't want a by-without-by. As a result, when we did for example X[Y, list(z)], it evaluated list(z) using by-without-by and was therefore slightly slower.

Additionally data.table users requested this to be explicit - see this and this for more context.

Hence by=.EACHI was added. Now, when we do:

X[Y, .N]
# [1] 3

it does what it's meant to do (avoids confusion). It returns the number of rows resulting from the join.

And,

X[Y, .N, by=.EACHI]

evaluates j-expression on the matching rows for each row in Y (corresponding to value from Y's key columns here). It'd be easier to see this by using which=TRUE.

X[.(2), which=TRUE] # [1] 4 5
X[.(6), which=TRUE] # [1] 7

If we run .N for each, then we should get 2,1.

X[Y, .N, by=.EACHI]
#    x N
# 1: 2 2
# 2: 6 1

So we now have both functionalities. Hope this helps.

这篇关于.EACHI在data.table中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆