data.table 连接和 j-expression 意外行为 [英] data.table join and j-expression unexpected behavior

查看:10
本文介绍了data.table 连接和 j-expression 意外行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

R 2.15.0data.table 1.8.9 中:

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value]
#   a value
#   3     4

d[J(3)][, value]
#   4

我希望两者都能产生相同的输出(第二个),我相信他们应该.

I expected both to produce the same output (the 2nd one) and I believe they should.

为了澄清这不是 J 语法问题,相同的期望适用于以下(与上述相同)表达式:

In the interest of clearing up that this is not a J syntax issue, same expectation applies to the following (identical to the above) expressions:

t = data.table(a = 3, key = "a")
d[t, value]
d[t][, value]

我希望以上两个都返回完全相同的输出.

I would expect both of the above to return the exact same output.

所以让我重新表述一下这个问题 - 为什么(data.table 如此设计)key 列会在 中自动打印出来>d[t, 值]?

So let me rephrase the question - why is (data.table designed so that) the key column printed out automatically in d[t, value]?

更新(基于下面的答案和评论):感谢@Arun 等人,我现在了解设计原因.上面打印 key 的原因是因为每次通过 进行 data.table 合并时都会出现一个隐藏的 byX[Y] 语法,而 by 是关键.以这种方式设计的原因似乎如下 - 由于在合并时必须执行 by 操作,因此不妨利用这一点,而不是执行另一个 by如果你打算通过合并的关键来做到这一点.

Update (based on answers and comments below): Thanks @Arun et al., I understand the design-why now. The reason the above prints the key is because there is a hidden by present every time you do a data.table merge via the X[Y] syntax, and that by is by the key. The reason it's designed this way seems to be the following - since the by operation has to be performed when merging, one might as well take advantage of that and not do another by if you are going to do that by the key of the merge.

话虽如此,我相信这是一个语法设计缺陷.我阅读 data.table 语法 d[i, j, by = b] 的方式是

Now that said, I believe that's a syntax design flaw. The way I read data.table syntax d[i, j, by = b] is

d,应用 i 操作(是子集或合并或诸如此类),然后执行 j 表达式by"b

take d, apply the i operation (be that subsetting or merging or whatnot), and then do the j expression "by" b

by-without-by 打破了这种阅读并介绍了一个必须特别考虑的案例(我是在 i 上合并吗,by 只是合并的关键吗?, 等等).我相信这应该是 data.table 的工作 - 当 by 等于键,应该以另一种方式完成(例如,通过内部检查 by 表达式是否实际上是合并的键).

The by-without-by breaks this reading and introduces cases one has to think about specifically (am I merging on i, is by just the key of the merge, etc). I believe this should be the job of the data.table - the commendable effort to make data.table faster in one particular case of the merge, when the by is equal to the key, should be done in an alternative way (e.g. by checking internally if the by expression is actually the key of the merge).

推荐答案

截至 data.table 1.9.3,默认行为已更改,下面的示例产生相同的结果.要获得 by-without-by 结果,现在必须指定一个明确的 by=.EACHI:

As of data.table 1.9.3, the default behavior has been changed and the examples below produce the same result. To get the by-without-by result, one now has to specify an explicit by=.EACHI:

d = data.table(a = 1:5, value = 2:6, key = "a")

d[J(3), value]
#[1] 4

d[J(3), value, by = .EACHI]
#   a value
#1: 3     4

这里有一个稍微复杂一点的例子来说明区别:

And here's a slightly more complicated example, illustrating the difference:

d = data.table(a = 1:2, b = 1:6, key = 'a')
#   a b
#1: 1 1
#2: 1 3
#3: 1 5
#4: 2 2
#5: 2 4
#6: 2 6

# normal join
d[J(c(1,2)), sum(b)]
#[1] 21

# join with a by-without-by, or by-each-i
d[J(c(1,2)), sum(b), by = .EACHI]
#   a V1
#1: 1  9
#2: 2 12

# and a more complicated example:
d[J(c(1,2,1)), sum(b), by = .EACHI]
#   a V1
#1: 1  9
#2: 2 12
#3: 1  9

这篇关于data.table 连接和 j-expression 意外行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆