如何避免在data.table中的优化警告 [英] how to avoid an optimization warning in data.table
本文介绍了如何避免在data.table中的优化警告的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下代码:
> dt = data.table(a = c(rep(3,5),rep(4,5)),b = 1:10,c = 11:20,d = 21:30, )
> dt
abcd
1:3 1 11 21
2:3 2 12 22
3:3 3 13 23
4:3 4 14 24
5 :3 5 15 25
6:4 6 16 26
7:4 7 17 27
8:4 8 18 28
9:4 9 19 29
10 :4 10 20 30
> dt [,lapply(.SD,sum),by =a]
查找组(bysameorder = TRUE)...在0秒内完成。 bysameorder = TRUE and o__ is length 0
优化j从'lapply(.SD,sum)'到'list(sum(b),sum(c),sum(d))'
...完成组0秒
abcd
1:3 15 65 115
2:4 40 90 140
> dt [,c(count = .N,lapply(.SD,sum)),by =a]
查找组(bysameorder = TRUE)...在0秒内完成。 bysameorder = TRUE and o__ is length 0
优化开启,但j保持不变为'c(count = .N,lapply(.SD,sum))'
开始组群...是一个命名列表。为每个组一遍又一遍地创建相同的名称是非常低效的。当j = list(...)时,为了提高效率,在分组完成后检测,删除和回退任何名称。例如,使用j = transform()防止加速(考虑更改为:=)。此邮件可能会在将来升级为警告。
完成狗群在0秒
a计数bcd
1:3 5 15 65 115
2:4 5 40 90 140
如何避免可怕的非常低效的警告?
我可以添加 count
列:
dt $ count< - 1
> dt
abcd count
1:3 1 11 21 1
2:3 2 12 22 1
3:3 3 13 23 1
4:3 4 14 24 1
5:3 5 15 25 1
6:4 6 16 26 1
7:4 7 17 27 1
8:4 8 18 28 1
9: 4 9 19 29 1
10:4 10 20 30 1
> dt [,lapply(.SD,sum),by =a]
查找组(bysameorder = TRUE)...在0秒内完成。 bysameorder = TRUE and o__ is length 0
优化j从'lapply(.SD,sum)'到'list(sum(b),sum(c),sum(d),sum(count))'
开始组合...在0秒内完成组合
abcd计数
1:3 15 65 115 5
2:4 40 90 140 5
但这看起来不太优雅...
解决方案我可以想到的一种方法是通过引用分配
count
:dt.out < - dt [,lapply(.SD,sum),by = a]
dt.out [,count:= dt [,.N,by = a] [,N ]]
#alternate:count:= table(dt $ a)
#abcd count
#1:3 15 65 115 5
#2:4 40 90 140 5
编辑1: / strong>我仍然认为它只是消息,而不是警告。但是如果你仍然想避免这种情况,只需:
dt.out [,count:= as.numeric ,.N,by = a] [,N])]
$ b b编辑2:非常有趣。相当于多个
:=
:不会产生相同的邮件。dt.out [,`:=`(count = dt [,.N,by = a] [,N])]
#检测j使用这些列:a
#查找组(bysameorder = TRUE)...在0.001秒内完成。 bysameorder = TRUE and o__ is length 0
#检测到j使用这些列:< none>
#优化开启,但j保持不变为'.N'
#开始dogroups ...完成dogroups在0秒
#检测到j使用这些列:N
#分配到所有2行
#直接plonk未命名的RHS,没有副本。
I have the following code:
> dt <- data.table(a=c(rep(3,5),rep(4,5)),b=1:10,c=11:20,d=21:30,key="a") > dt a b c d 1: 3 1 11 21 2: 3 2 12 22 3: 3 3 13 23 4: 3 4 14 24 5: 3 5 15 25 6: 4 6 16 26 7: 4 7 17 27 8: 4 8 18 28 9: 4 9 19 29 10: 4 10 20 30 > dt[,lapply(.SD,sum),by="a"] Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d))' Starting dogroups ... done dogroups in 0 secs a b c d 1: 3 15 65 115 2: 4 40 90 140 > dt[,c(count=.N,lapply(.SD,sum)),by="a"] Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 Optimization is on but j left unchanged as 'c(count = .N, lapply(.SD, sum))' Starting dogroups ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future. done dogroups in 0 secs a count b c d 1: 3 5 15 65 115 2: 4 5 40 90 140
How do I avoid the scary "very inefficient" warning?
I can add the
count
column before the join:> dt$count <- 1 > dt a b c d count 1: 3 1 11 21 1 2: 3 2 12 22 1 3: 3 3 13 23 1 4: 3 4 14 24 1 5: 3 5 15 25 1 6: 4 6 16 26 1 7: 4 7 17 27 1 8: 4 8 18 28 1 9: 4 9 19 29 1 10: 4 10 20 30 1 > dt[,lapply(.SD,sum),by="a"] Finding groups (bysameorder=TRUE) ... done in 0secs. bysameorder=TRUE and o__ is length 0 Optimized j from 'lapply(.SD, sum)' to 'list(sum(b), sum(c), sum(d), sum(count))' Starting dogroups ... done dogroups in 0 secs a b c d count 1: 3 15 65 115 5 2: 4 40 90 140 5
but this does not look too elegant...
解决方案One way I could think of is to assign
count
by reference:dt.out <- dt[, lapply(.SD,sum), by = a] dt.out[, count := dt[, .N, by=a][, N]] # alternatively: count := table(dt$a) # a b c d count # 1: 3 15 65 115 5 # 2: 4 40 90 140 5
Edit 1: I still think it's just message and not a warning. But if you still want to avoid that, just do:
dt.out[, count := as.numeric(dt[, .N, by=a][, N])]
Edit 2: Very interesting. Doing the equivalent of multiple
:=
assignment does not produce the same message.dt.out[, `:=`(count = dt[, .N, by=a][, N])] # Detected that j uses these columns: a # Finding groups (bysameorder=TRUE) ... done in 0.001secs. bysameorder=TRUE and o__ is length 0 # Detected that j uses these columns: <none> # Optimization is on but j left unchanged as '.N' # Starting dogroups ... done dogroups in 0 secs # Detected that j uses these columns: N # Assigning to all 2 rows # Direct plonk of unnamed RHS, no copy.
这篇关于如何避免在data.table中的优化警告的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文