关于data.table 1.9.2中的GForce [英] About GForce in data.table 1.9.2

查看:16
本文介绍了关于data.table 1.9.2中的GForce的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何在 data.table 1.9.2 中充分利用 GForce

<块引用>

新优化:GForce.不是对数据进行分组,而是将组位置传递到 sum 和 mean(gsum 和 gmean)的分组版本中,然后在单个顺序遍历列中计算所有组的结果,以提高缓存效率.此外,由于 g* 函数只被调用一次,我们不需要为每个组寻找加快调用 sum 或 mean 的方法.`

提交以下代码时

DT <- data.table(A=c(NA,NA,1:3), B=c("a",NA,letters[1:3]))DT[,sum(A,na.rm=TRUE),by= B]

我收到了

<上一页>乙V11:一个 12:不适用 03:乙24:c 3

当尝试 DT[,sum(A,na.rm=FALSE),by= B] 时,我得到了

<上一页>乙V11:不适用2:NA NA3:乙24:c 3

添加 na.rm = TRUE/FALSE 选项的结果是否解释了 GForce 的作用?

非常感谢!

解决方案

na.rm无关.你展示的东西以前也很好用.但是,我明白你为什么会这么想.这是同一新闻项目的其余部分:

现在应用 GForce 的示例:DT[,sum(x,na.rm=),by=...] # 是DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...] # 是DT[,lapply(.SD,sum,na.rm=),by=...] # 是DT[,list(sum(x),min(y)),by=...] # 没有.gmin 尚不可用GForce 是 2 级优化.关闭它:options(datatable.optimize=1)提醒:要查看优化和其他信息,请设置 verbose=TRUE

您无需做任何事情即可受益,这是一种自动优化.

这是一个包含 5 亿行和 4 列 (13GB) 的示例.首先创建并说明数据:

<代码>$ RR 版本 3.0.2 (2013-09-25) -- 飞盘航行"版权所有 (C) 2013 统计计算 R 基金会平台:x86_64-pc-linux-gnu(64位)>需要(数据表)加载所需包:data.tabledata.table 1.9.2 帮助类型:help("data.table")>DT = data.table(grp = sample(1e6,5e8,replace=TRUE),a = rnorm(1e6),b = rnorm(1e6),c = rnorm(1e6))>表()名称 NROW MB COLS 密钥[1,] DT 500,000,000 13352 grp,a,b,c总计:13,352MB>打印(DT)grp a b c1e+00: 695059 -1.4055192 1.587540028 1.71049912e+00:915263 -0.8239298 -0.513575696 -0.34295163e+00: 139937 -0.2202024 0.971816721 1.05974214e+00:651525 1.0026858 -1.157824780 0.31006165e+00: 438180 1.1074729 -2.513939427 0.8357155---5e+08: 705823 -1.4773420 0.004369457 -0.28675295e+08:716694 -0.6826147 -0.357086020 -0.40441645e+08:217509 0.4939808 -0.012797093 -1.10845645e+08:501760 1.7081212 -1.772721799 -0.71194325e+08:765653 -1.1141456 -1.569578263 0.4947304

现在是时候开启 GForce 优化了(默认).注意这里没有 setkey 首先.这就是所谓的 cold byad hoc by,当您想以多种不同方式进行分组时,这是一种常见的做法.

<代码>>system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])用户系统已过47.520 5.651 53.173>system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])用户系统已过47.372 5.676 53.049 #立即重复确认计时

现在关闭 GForce 优化(根据 NEWS 项目)以查看它的不同之处:

<代码>>选项(datatable.optimize=1)>system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])用户系统已过97.274 3.383 100.659>system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])用户系统已过97.199 3.423 100.624 #立即重复确认计时

最后确认结果是一样的:

<代码>>相同(ans1,ans2)[1] 对>打印(ans1)grp a b c1:695059 16.791281 13.269647 -10.6631182:915263 43.312584 -33.587933 4.4908423:139937 3.967393 -10.386636 -3.7660194:651525 -4.152362 9.339594 7.7401365:438180 4.725874 26.328877 9.063309---999996: 372601 -2.087248 -19.936420 21.172860999997:13912 18.414226 -1.744378 -7.951381999998:150074 -4.031619 8.433173 -22.041731999999: 385718 11.527876 6.807802 7.4050161000000: 906246 -13.857315 -23.702011 6.605254

请注意,data.table 根据组首次出现的时间保留了组的顺序.要对分组结果排序,请使用 keyby= 而不是 by=.

重新开启 GForce 优化(默认为 Inf 以受益于所有优化):

<代码>>选项(datatable.optimize=Inf)

旁白:如果您不熟悉 lapply(.SD,...) 语法,它只是一种通过分组列应用函数的方法.例如,这两行是等价的:

 DT[, lapply(.SD,sum), by=grp] # (1)DT[, list(sum(a),sum(b),sum(c)), by=grp] # (2) 完全一样

第一个 (1) 更有用,因为您有更多列,尤其是与 .SDcols 结合使用来控制通过哪个列子集应用函数.

NEWS 项目只是试图传达无论使用哪种语法,或者是否通过 na.rm,GForce 优化仍然会被应用.意思是您可以在一次调用中混合 sum()mean() (语法(2)允许),但是一旦您执行其他操作(例如 min()),则 GForce 不会启动,因为 min 尚未完成;目前只有 meansum 有 GForce 优化.您可以使用 verbose=TRUE 查看是否正在应用 GForce.

本次计时所用机器详情:

$ lscpu架构:x86_64CPU 操作模式:32 位、64 位字节顺序:小尾数CPU:8在线 CPU(s) 列表:0-7每个内核的线程数:8每个插槽的核心数:1插座:1NUMA 节点:1供应商 ID:GenuineIntelCPU系列:6型号:62步进:4中央处理器频率:2494.022BogoMIPS:4988.04管理程序供应商:Xen虚拟化类型:全L1d 缓存:32KL1i 缓存:32K二级缓存:256K三级缓存:25600KNUMA node0 CPU(s): 0-7

I don't know how to make great advantage of GForce in data.table 1.9.2

New optimization: GForce. Rather than grouping the data, the group locations are passed into grouped versions of sum and mean (gsum and gmean) which then compute the result for all groups in a single sequential pass through the column for cache efficiency. Further, since the g*function is called just once, we don't need to find ways to speed up calling sum or mean repetitively for each group. `

when submitting the following code

DT <- data.table(A=c(NA,NA,1:3), B=c("a",NA,letters[1:3]))
DT[,sum(A,na.rm=TRUE),by= B]

I got this

    B V1
1:  a  1
2: NA  0
3:  b  2
4:  c  3

and when trying DT[,sum(A,na.rm=FALSE),by= B], I got

    B  V1
1:  a  NA
2:  NA NA
3:  b  2
4:  c  3

Does that results explain what the GForce do, Adding the na.rm = TRUE/FALSE option?

Thanks a lot!

解决方案

It's nothing to do with na.rm. What you show worked fine before as well. However, I can see why you might have thought that. Here is the rest of the same NEWS item :

Examples where GForce applies now :
    DT[,sum(x,na.rm=),by=...]                       # yes
    DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...]  # yes
    DT[,lapply(.SD,sum,na.rm=),by=...]              # yes
    DT[,list(sum(x),min(y)),by=...]                 # no. gmin not yet available
GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
Reminder: to see the optimizations and other info, set verbose=TRUE

You don't need to do anything to benefit, it's an automatic optimization.

Here's an example on 500 million rows and 4 columns (13GB). First create and illustrate the data :

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> require(data.table)
Loading required package: data.table
data.table 1.9.2  For help type: help("data.table")

> DT = data.table( grp = sample(1e6,5e8,replace=TRUE), 
                   a = rnorm(1e6),
                   b = rnorm(1e6),
                   c = rnorm(1e6))
> tables()
     NAME        NROW    MB COLS      KEY
[1,] DT   500,000,000 13352 grp,a,b,c    
Total: 13,352MB
> print(DT)
          grp          a            b          c
1e+00: 695059 -1.4055192  1.587540028  1.7104991
2e+00: 915263 -0.8239298 -0.513575696 -0.3429516
3e+00: 139937 -0.2202024  0.971816721  1.0597421
4e+00: 651525  1.0026858 -1.157824780  0.3100616
5e+00: 438180  1.1074729 -2.513939427  0.8357155
   ---                                          
5e+08: 705823 -1.4773420  0.004369457 -0.2867529
5e+08: 716694 -0.6826147 -0.357086020 -0.4044164
5e+08: 217509  0.4939808 -0.012797093 -1.1084564
5e+08: 501760  1.7081212 -1.772721799 -0.7119432
5e+08: 765653 -1.1141456 -1.569578263  0.4947304

Now time with GForce optimization on (the default). Notice here there is no setkey first. This is what's known as cold by or ad hoc by which is common practice when you want to group in lots of different ways.

> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 47.520   5.651  53.173 
> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 47.372   5.676  53.049      # immediate repeat to confirm timing

Now turn off GForce optimization (as per NEWS item) to see the difference it makes :

> options(datatable.optimize=1)

> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 97.274   3.383 100.659 
> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 97.199   3.423 100.624      # immediate repeat to confirm timing

Finally, confirm the results are the same :

> identical(ans1,ans2)
[1] TRUE
> print(ans1)
            grp          a          b          c
      1: 695059  16.791281  13.269647 -10.663118
      2: 915263  43.312584 -33.587933   4.490842
      3: 139937   3.967393 -10.386636  -3.766019
      4: 651525  -4.152362   9.339594   7.740136
      5: 438180   4.725874  26.328877   9.063309
     ---                                        
 999996: 372601  -2.087248 -19.936420  21.172860
 999997:  13912  18.414226  -1.744378  -7.951381
 999998: 150074  -4.031619   8.433173 -22.041731
 999999: 385718  11.527876   6.807802   7.405016
1000000: 906246 -13.857315 -23.702011   6.605254

Notice that data.table retains the order of the groups according to when they first appeared. To order the grouped result, use keyby= instead of by=.

To turn GForce optimization back on (default is Inf to benefit from all optimizations) :

> options(datatable.optimize=Inf)

Aside : if you're not familiar with the lapply(.SD,...) syntax, it's just a way to apply a function through columns by group. For example, these two lines are equivalent :

 DT[, lapply(.SD,sum), by=grp]               # (1)
 DT[, list(sum(a),sum(b),sum(c)), by=grp]    # (2) exactly the same

The first (1) is more useful as you have more columns, especially in combination with .SDcols to control which subset of columns to apply the function through.

The NEWS item was just trying to convey that it doesn't matter which of these syntax is used, or whether you pass na.rm or not, GForce optimization will still be applied. It's saying that you can mix sum() and mean() in one call (which syntax (2) allows), but as soon as you do something else (like min()), then GForce won't kick in since min isn't done yet; only mean and sum have GForce optimizations currently. You can use verbose=TRUE to see if GForce is being applied.

Details of the machine used for this timing :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.022
BogoMIPS:              4988.04
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7

这篇关于关于data.table 1.9.2中的GForce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆