如何取消对data.table中的列表列的分组? [英] How to ungroup list columns in data.table?

查看:50
本文介绍了如何取消对data.table中的列表列的分组?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

tidyr 提供了

tidyr provides the unnest function that help expanding list columns.

这类似于(a)快了20倍的取消分组函数在kdb中.

This is similar to the much (20x) faster ungroup function in kdb.

我正在寻找一个类似(但速度要快得多)的函数,假设一个data.table包含多个列表列,每列的每一行具有相同数量的元素,将会扩展data.table.

I am looking for a similar (but much faster) function that, assuming a data.table that contains several list columns, each with the same number of element on each row, would expand the data.table.

这是这篇文章的扩展.

library(data.table)
library(tidyr)
t = Sys.time()
DT = data.table(a=c(1,2,3),
                b=c('q','w','e'),
                c=list(rep(t,2),rep(t+1,3),rep(t,0)),
                d=list(rep(1,2),rep(20,3),rep(1,0)))

print(DT)
   a b                                                           c        d
1: 1 q                     2016-01-09 09:55:14,2016-01-09 09:55:14      1,1
2: 2 w 2016-01-09 09:55:15,2016-01-09 09:55:15,2016-01-09 09:55:15 20,20,20
3: 3 e                                                                     

print(unnest(DT))
Source: local data frame [5 x 4]

      a     b                   c     d
  (dbl) (chr)              (time) (dbl)
1     1     q 2016-01-09 09:55:14     1
2     1     q 2016-01-09 09:55:14     1
3     2     w 2016-01-09 09:55:15    20
4     2     w 2016-01-09 09:55:15    20
5     2     w 2016-01-09 09:55:15    20

这是我自己的尝试...似乎快了2倍,但应该大大改善...

Here is my own attempt... that seems to be 2x quicker but should be largely improved...

dtUngroup <- function(DT){
  colClasses <- lapply(DT,FUN=class)
  listCols <- colnames(DT)[colClasses=='list']
  if(length(listCols)>0){
    nonListCols <- setdiff(colnames(DT),listCols)
    nbListElem <- unlist(DT[,lapply(.SD,FUN=lengths),.SDcols=(listCols[1L])])
    DT1 <- DT[,lapply(.SD,FUN=rep,times=(nbListElem)),.SDcols=(nonListCols)]
    DT1[,(listCols):=DT[,lapply(.SD,FUN=function(x) do.call('c',x)),.SDcols=(listCols)]]
    return(DT1)
  }
  return(DT)
} 
dtUngroup(DT)[]
   a b                   c  d
1: 1 q 2016-01-09 09:55:14  1
2: 1 q 2016-01-09 09:55:14  1
3: 2 w 2016-01-09 09:55:15 20
4: 2 w 2016-01-09 09:55:15 20
5: 2 w 2016-01-09 09:55:15 20

推荐答案

使用:

na.omit(DT[, lapply(.SD, unlist), a][, c := as.POSIXct(c, origin="1970-01-01")])

给予:

   a b                   c  d
1: 1 q 2016-01-09 12:17:24  1
2: 1 q 2016-01-09 12:17:24  1
3: 2 w 2016-01-09 12:17:25 20
4: 2 w 2016-01-09 12:17:25 20
5: 2 w 2016-01-09 12:17:25 20

如果a列中的值不是每一行都唯一,则可以使用:

When the values in the a column are not unique for each row, you can use:

na.omit(DT[, lapply(.SD, unlist), by=1:nrow(DT)][, c := as.POSIXct(c, origin="1970-01-01")])

替补球员:

> microbenchmark(dtUngroup(DT)[], jaap())
Unit: milliseconds
            expr      min       lq     mean   median       uq      max neval cld
 dtUngroup(DT)[] 3.935677 4.005596 4.189208 4.066196 4.227372 6.750338   100   b
          jaap() 1.977175 2.039830 2.094536 2.074314 2.132525 2.309848   100  a 

这篇关于如何取消对data.table中的列表列的分组?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆