如何将data.table连续按几列分组 [英] How to group data.table by several columns consecutively

查看：67 发布时间：2020/10/15 19:50:07 r data.table

本文介绍了如何将data.table连续按几列分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想获取一堆由数百个分组变量组成的描述性统计数据。我从知道如何将data.table按多列分组？如果我希望统计数据用于分组变量的组合，则可以在分组参数中使用list（）。就我而言，我想要的是Y的每个水平的平均值而不是Z的每个水平的平均值。

I want to take a bunch of descriptive statistics grouped by several hundred grouping vars. I know from How to group data.table by multiple columns? that I can use list( ) in the grouping parameter if I want the stat for a combination of grouping vars. In my case I want the mean for each level of Y than the mean for each level of Z

    # example data
    set.seed(007) 
    DF <- data.frame(X=1:50000, Y=sample(c(0,1), 50000, TRUE), Z=sample(0:5, 50000, TRUE))

    library(data.table)
    DT <- data.table(DF)

    # I tried this - but this gives the mean for each combination of Y and Z
    DT[, mean(X), by=list(Y, Z)]

    # so does this
    DT[, mean(X), by=c("Y", "Z")]

    # This works but.... 
    out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
    out <- do.call( rbind, out )
   #...but it is really slow.

我有1亿条记录和400多个分组变量，因此需要一些东西-效率很高。 lapply选项最多可增加几天的处理时间

I have 100 million records and 400+ grouping vars so need something - somewhat efficient. The lapply option adds up to several days of extra processing time

options( digits=15 )
start.time <- Sys.time()
out <- lapply( c( "Y","Z") , FUN= function(K){ DT[, mean(X), by=get(K)]})
end.time <- Sys.time()
time.taken <- end.time - start.time

start.time <- Sys.time()
DT[, mean(X), by=c("Y")]
DT[, mean(X), by=c("Z")]
end.time <- Sys.time()
time.taken2 <- end.time - start.time
time.taken - time.taken2

基准

Benchmark

# create data
n_rows = 1e6L
n_vars = 5
n_grps = 1e2L
set.seed(007) 
DT <- data.table(rn = seq_len(n_rows))
for (i in seq_len(n_vars)) set(DT, , paste0("X", i), i*rnorm(n_rows))
for (i in seq_len(n_grps)) set(DT, , paste0("Z", i), sample(0:i, n_rows, TRUE))

grps <- grep("^Z", names(DT), value = TRUE)
vars <- grep("^X", names(DT), value = TRUE)

# run benchmark
bm <- microbenchmark::microbenchmark(
  gs = {
    groupingsets(DT, lapply(.SD, mean), by = grps, sets = as.list(grps), .SDcols = vars)
  },
  lapply1 = {
    rbindlist(lapply(grps, function(K) DT[, lapply(.SD, mean), by = K, .SDcols = vars]), 
                fill = TRUE)
  },
  lapply2 = {
    out <- lapply(grps, function(K) DT[, lapply(.SD, mean), by = get(K), .SDcols = vars])
    do.call(rbind, out)
  },
  times = 3L
)
print(bm)

即使具有100万行和100个分组变量，运行中也没有显着差异次（ groupingsets（）似乎比其他两种方法慢一点）：

Even with 1 M rows and 100 grouping vars, there is no remarkable difference in the run times (groupingsets() seems to be a little slower than the two other approaches):

Unit: seconds
    expr      min       lq     mean   median       uq      max neval
      gs 3.602689 3.606646 3.608343 3.610603 3.611169 3.611735     3
 lapply1 3.524957 3.546060 3.561130 3.567163 3.579217 3.591270     3
 lapply2 3.562424 3.569284 3.577199 3.576144 3.584586 3.593027     3

这篇关于如何将data.table连续按几列分组的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何将data.table连续按几列分组 [英] How to group data.table by several columns consecutively

问题描述

推荐答案

基准

Benchmark

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何将data.table连续按几列分组 [英] How to group data.table by several columns consecutively

问题描述

推荐答案

基准

Benchmark

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭