理解在数据表中通过引用分配的优化消息 [英] understanding optimisation messages on assignment by reference in a data.table

查看:71
本文介绍了理解在数据表中通过引用分配的优化消息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是从我在回答这个问题时的观察结果@sds 此处

This is from an observation during my answering this question from @sds here.

首先,让我 data.table

options(datatable.verbose = TRUE)
dt <- data.table(a = c(rep(3, 5), rep(4, 5)), b=1:10, c=11:20, d=21:30, key="a")

现在,假设想得到所有列的总和, code> a ,那么我们可以:

Now, suppose one wants to get the sum of all columns grouped by column a, then, we could do:

dt.out <- dt[, lapply(.SD, sum), by = a]

要将属于每个组的条目数添加到 dt.out ,那么我通常通过引用将其分配如下:

Now, suppose I'd want to add also the number of entries that belong to each group to dt.out, then I normally assign it by reference as follows:

dt.out[, count := dt[, .N, by=a][, N]]
# or alternatively
dt.out[, count := dt[, .N, by=a][["N"]]]


$ b b

在此作业中, data.table 产生的消息之一是:

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

这是来自data.table的源目录 assign中的文件的消息。 C 。我不想粘贴相关的片段在这里,因为它是大约18行。如果需要,只需留下注释,我将粘贴代码。 dt [,.N,by = a] [[N]] 只是给出 [1] 5 5 。因此,一个命名的向量。我不明白RHS中这个循环列表

This is a message from a file in data.table's source directory assign.C. I dont want to paste the relevant snippet here as it's about 18 lines. If necessary, just leave a comment and I'll paste the code. dt[, .N, by=a][["N"]] just gives [1] 5 5. So, it's not a named vector. And I don't understand what this recycled list in RHS is..

但是如果我这样做: p>

But if I do:

dt.out[, `:=`(count = dt[, .N, by=a][, N])]
# or equivalently
dt.out[, `:=`(count = dt[, .N, by=a][["N"]])]
Then, I get the message:

根据我的理解,RHS在第一种情况下已经被重复 ,意思是它做一个副本(浅/深,这我不知道)。如果是,为什么会这样?

Direct plonk of unnamed RHS, no copy.

即使不是,为什么在内部通过引用分配的更改?有任何想法吗?

As I understand this, the RHS has been duplicated in the first case, meaning it's making a copy (shallow/deep, this I don't know). If so, why is this happening?

为了显示我在撰写这篇文章时(我似乎忘记了)在主意中所提出的主要分配为 dt.out [,count:= dt [,.N,by = a] [[N]]] 第二种方式)

Even if not, why the changes in assignment by reference between two internally? Any ideas?

推荐答案

更新:
$ b

解决方案
Update: The expression, 



< project.org/tracker/index.php?func=detail&aid=2722&group_id=240&atid=978rel =nofollow> FR#2722 )。这是新闻


o形式 DT [,c(...,lapply(.SD,fun) grap] 现在已优化,只要 .SD 仅以 lapply(.SD,fun)

has been optimised internally in commit #1242 of v1.9.3 (FR #2722). Here's the entry from NEWS:

例如: DT [,c(.I,lapply(.SD,sum),mean(x)日志)),by = grp]

优化为: DT [,list(.I,x = sum(x),y​​ = sum (y),...,mean(x),log(x),log(y),...),by = grp]

但是例如尚未优化,因此 DT [,c(.SD,lapply(.SD,sum)),by = grp]
这部分解析 FR#2722 。感谢Sam Steingold提交FR。

For ex: DT[, c(.I, lapply(.SD, sum), mean(x), lapply(.SD, log)), by=grp]
is optimised to: DT[, list(.I, x=sum(x), y=sum(y), ..., mean(x), log(x), log(y), ...), by=grp]






NAMED vector 这意味着在C级的内部R sense;即一个对象是否已经被分配了一个符号并被调用,而不是一个原子向量是否具有names属性。 SEXP结构中的 NAMED 值取值0,1或2.R使用它来知道它是否需要copy-on-subassign。请参阅R-int的第1.1.2节。

But DT[, c(.SD, lapply(.SD, sum)), by=grp] for example isn't optimised yet. This partially resolves FR #2722. Thanks to Sam Steingold for filing the FR.

如果在 j c $ c> data.table 可以处理:


Where it says NAMED vector it means that in the internal R sense at C level; i.e., whether an object has been assigned a symbol and is called something, not whether an atomic vector has a "names" attribute or not. The NAMED value in the SEXP structure takes value 0, 1 or 2. R uses that to know whether it needs to copy-on-subassign or not. See section 1.1.2 of R-ints.

What would be better is if optimization of j in data.table could handle :

目前只有较简单的形式已优化:

DT[, c(lapply(.SD,sum),.N), by=a]

That works but may be slow. Currently only the simpler form is optimized :






要回答主要问题,请执行以下操作:


To answer main question, yes the following :

Direct plonk of unnamed RHS, no copy.

是理想的:

RHS for item 1 has been duplicated. Either NAMED vector or recycled list RHS.

另一种实现方法是:

dt.out[, count := dt[, .N, by=a]$N]

我不太清楚为什么 [[N]] 返回 NAM / code>与 $ N 相比较。

I'm not quite sure why [["N"]] returns a NAM(2) compared to $N which doesn't.

这篇关于理解在数据表中通过引用分配的优化消息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆