包括使用data.table [,, by = ...]时的所有排列 [英] Including all permutations when using data.table[,,by=...]

查看:194
本文介绍了包括使用data.table [,, by = ...]时的所有排列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的 data.table ,使用,通过折叠到月级。 p>

有5个vars,#级别: c(4,3,106,3,1380)。 106是几个月,1380是一个地理单位。因为结果有一些0,因为一些单元格没有值。 通过删除这些,但我想保留它们。



可复制的示例:

  require(data.table)

set.seed(1)
n < - 1000
s < - function(n,l = 5)sample(letters [seq(1)],n,replace = TRUE)
dat < - data.table(x = runif(n),g1 = s (g1,g2,g3)],g2 = s(n),g2 = s(n),g3 = s(n,25))
datCollapsed <
datCollapsed [,prod(dim(table(g1,g2,g3)))]#应该有多少:5 * 5 * 25 = 625
nrow b $ b

是否有一种有效的方法用0填充这些缺失值,在结果collapsed data.table?

解决方案

我也会使用交叉连接,对 [。data.table


$的原始调用的 i b $ b

  keycols < -  c(g1,g2,g3)##分组列
setkeyv(dat,keycols)## Set dat的键
ii < - do.call(CJ,sapply(dat [,keycols,with = FALSE],unique))## CJ()形成索引
datCollapsed< - dat [ii ,list(nv = .N)] ##聚合

##检查它是否工作
nrow(datCollapsed)
#[1] 625
表$ nv)
#0 1 2 3 4 5 6
#135 191 162 82 39 13 3


b $ b

这种方法被称为by-without-by,并且如?data.table 中所述,它同样有效和快速作为通过参数传递分组说明:


高级:聚合已知组的子集是
在i中传递这些组时特别有效。当
'i'是'data.table'时,'DT [i,j]'为'i'的每一行
计算'j'。我们将此称为 分组。
因此,自连接'DT [data.table(unique(colA)),j]'是与'DT [,j,by = colA]'相同的



I have a large data.table that I am collapsing to the month level using ,by.

There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.

Reproducible example:

require(data.table)

set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are

Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?

解决方案

I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:

keycols <- c("g1", "g2", "g3")                              ## Grouping columns
setkeyv(dat, keycols)                                       ## Set dat's key
ii <- do.call(CJ, sapply(dat[,keycols,with=FALSE], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)]                        ## Aggregate

## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
#   0   1   2   3   4   5   6 
# 135 191 162  82  39  13   3 

This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:

Advanced: Aggregation for a subset of known groups is particularly efficient when passing those groups in 'i'. When 'i' is a 'data.table', 'DT[i,j]' evaluates 'j' for each row of 'i'. We call this by without by or grouping by i. Hence, the self join 'DT[data.table(unique(colA)),j]' is identical to 'DT[,j,by=colA]'.

这篇关于包括使用data.table [,, by = ...]时的所有排列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆