如何将data.table连续按几列分组 [英] How to group data.table by several columns consecutively

查看:67
本文介绍了如何将data.table连续按几列分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想获取一堆由数百个分组变量组成的描述性统计数据。我从知道如何将data.table按多列分组?如果我希望统计数据用于分组变量的组合,则可以在分组参数中使用list()。就我而言,我想要的是Y的每个水平的平均值而不是Z的每个水平的平均值。

I want to take a bunch of descriptive statistics grouped by several hundred grouping vars. I know from How to group data.table by multiple columns? that I can use list( ) in the grouping parameter if I want the stat for a combination of grouping vars. In my case I want the mean for each level of Y than the mean for each level of Z

    # example data
    set.seed(007) 
    DF <- data.frame(X=1:50000, Y=sample(c(0,1), 50000, TRUE), Z=sample(0:5, 50000, TRUE))

    library(data.table)
    DT <- data.table(DF)

    # I tried this - but this gives the mean for each combination of Y and Z
    DT[, mean(X), by=list(Y, Z)]

    # so does this
    DT[, mean(X), by=c("Y", "Z")]

    # This works but.... 
    out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
    out <- do.call( rbind, out )
   #...but it is really slow.  

我有1亿条记录和400多个分组变量,因此需要一些东西-效率很高。 lapply选项最多可增加几天的处理时间

I have 100 million records and 400+ grouping vars so need something - somewhat efficient. The lapply option adds up to several days of extra processing time

options( digits=15 )
start.time <- Sys.time()
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
end.time <- Sys.time()
time.taken <- end.time - start.time

start.time <- Sys.time()
DT[, mean(X), by=c("Y")]
DT[, mean(X), by=c("Z")]
end.time <- Sys.time()
time.taken2 <- end.time - start.time
time.taken - time.taken2


推荐答案

在开发版本1.10.5中, data.table 已获得分组集合聚合函数,该函数可计算产生多个(小)总计的各个分组级别的聚合。

With development version 1.10.5, data.table has gained Grouping Set aggregation functions which calculate aggregates at various levels of groupings producing multiple (sub-)totals.

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2018-01-31 02:23:45 UTC

grp_vars <- setdiff(names(DF), "X")
groupingsets(setDT(DF), mean(X), by = grp_vars, sets = as.list(grp_vars))




    Y  Z       V1
1:  1 NA 24960.98
2:  0 NA 25039.96
3: NA  5 24652.44
4: NA  0 25006.61
5: NA  2 25223.83
6: NA  3 24959.26
7: NA  1 25095.58
8: NA  4 25068.84




基准



Benchmark


# create data
n_rows = 1e6L
n_vars = 5
n_grps = 1e2L
set.seed(007) 
DT <- data.table(rn = seq_len(n_rows))
for (i in seq_len(n_vars)) set(DT, , paste0("X", i), i*rnorm(n_rows))
for (i in seq_len(n_grps)) set(DT, , paste0("Z", i), sample(0:i, n_rows, TRUE))

grps <- grep("^Z", names(DT), value = TRUE)
vars <- grep("^X", names(DT), value = TRUE)

# run benchmark
bm <- microbenchmark::microbenchmark(
  gs = {
    groupingsets(DT, lapply(.SD, mean), by = grps, sets = as.list(grps), .SDcols = vars)
  },
  lapply1 = {
    rbindlist(lapply(grps, function(K) DT[, lapply(.SD, mean), by = K, .SDcols = vars]), 
                fill = TRUE)
  },
  lapply2 = {
    out <- lapply(grps, function(K) DT[, lapply(.SD, mean), by = get(K), .SDcols = vars])
    do.call(rbind, out)
  },
  times = 3L
)
print(bm)


即使具有100万行和100个分组变量,运行中也没有显着差异次( groupingsets()似乎比其他两种方法慢一点):

Even with 1 M rows and 100 grouping vars, there is no remarkable difference in the run times (groupingsets() seems to be a little slower than the two other approaches):


Unit: seconds
    expr      min       lq     mean   median       uq      max neval
      gs 3.602689 3.606646 3.608343 3.610603 3.611169 3.611735     3
 lapply1 3.524957 3.546060 3.561130 3.567163 3.579217 3.591270     3
 lapply2 3.562424 3.569284 3.577199 3.576144 3.584586 3.593027     3


这篇关于如何将data.table连续按几列分组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆