适当/最快的方式重塑数据表 [英] Proper/fastest way to reshape a data.table

查看:79
本文介绍了适当/最快的方式重塑数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在R中有数据表

  library(data.table)
set.seed(1234)
DT< - data。 (x = rep(c(1,2,3),each = 4),y = c(A,B),v = sample(1:100,12))
DT
xyv
[1,] 1 A 12
[2,] 1 B 62
[3,] 1 A 60
[4,] 1 B 61
[5,] 2 A 83
[6,] 2 B 97
[7,] 2 A 1
[8,] 2 B 22
[9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49
$ p>

我可以通过data.table中的组轻松求和变量v:

  out<  -  DT [,list(SUM = sum(v)),by = list(x,y)] 
out
xy SUM
[1,] 1 A 72
[2,] 1 B 123
[3,] 2 A 84
[4,] 2 B 119
[5,] 3 A 162
[ 6,] 3 B 96

但是,我想将组而不是行。我可以使用 reshape

  out,direction ='wide',idvar ='x',timevar ='y')
out
x SUM.A SUM.B
[1,] 1 72 123
[2,] 2 84 119
[3,] 3 162 96



更有效的方式重整数据后聚合它?有没有办法使用data.table操作将这些操作合并为一个步骤?

解决方案

data.table 包实现更快 melt / dcast 函数(在C中)。它还具有其他功能,允许融化和铸造多个列 。请参阅Github上的新使用data.tables的高效整形。 p>

熔化/ dcast函数for data.table自v1.9.0起已可用,其功能包括:




  • 在投放之前,不需要加载 reshape2 包。但是如果您希望加载其他操作,请在加载 data.table 之前加载。


  • dcast 也是一个S3通用。没有更多 dcast.data.table()。只需使用 dcast()


  • >:




    • 能够在list类型的列上融化。


    • 获得 variable.factor value.factor ,默认情况下 TRUE FALSE ,以与 reshape2 兼容。这允许直接控制 variable value 列的输出类型(作为因子或不是因子)。


    • melt.data.table na.rm = TRUE 参数在内部进行优化,以便在熔化过程中直接删除NA,因此效率更高。


    • NEW: 可以接受 measure.vars 的列表,并且列表中每个元素中指定的列将合并在一起。这通过使用 patterns()进一步实现。



  • dcast




    • 接受多个 fun.aggregate 和多个 value.var 。请参阅vignette或?dcast


    • 使用 rowid c $ c>函数直接在公式中生成id列,有时需要唯一标识行。



  • 旧基准:




    • melt :1000万行和5列,61.3秒减少到1.2秒。

    • dcast :100万行4列,192秒减少到3.6秒。




科隆提醒(2013年12月)简报幻灯片32:为什么不提交 dcast 请求 reshape2


I have a data table in R:

library(data.table)
set.seed(1234)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
      x y  v
 [1,] 1 A 12
 [2,] 1 B 62
 [3,] 1 A 60
 [4,] 1 B 61
 [5,] 2 A 83
 [6,] 2 B 97
 [7,] 2 A  1
 [8,] 2 B 22
 [9,] 3 A 99
[10,] 3 B 47
[11,] 3 A 63
[12,] 3 B 49

I can easily sum the variable v by the groups in the data.table:

out <- DT[,list(SUM=sum(v)),by=list(x,y)]
out
     x  y SUM
[1,] 1 A  72
[2,] 1 B 123
[3,] 2 A  84
[4,] 2 B 119
[5,] 3 A 162
[6,] 3 B  96

However, I would like to have the groups (y) as columns, rather than rows. I can accomplish this using reshape:

out <- reshape(out,direction='wide',idvar='x', timevar='y')
out
     x SUM.A SUM.B
[1,] 1    72   123
[2,] 2    84   119
[3,] 3   162    96

Is there a more efficient way to reshape the data after aggregating it? Is there any way to combine these operations into one step, using the data.table operations?

解决方案

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

  • melt:

    • is capable of melting on columns of type 'list'.

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

  • dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

  • Old benchmarks:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

这篇关于适当/最快的方式重塑数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆