关于数据表1.9.2中的GForce [英] About GForce in data.table 1.9.2

查看:98
本文介绍了关于数据表1.9.2中的GForce的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道如何在data.table 1.9.2中充分利用GForce。


新的优化:GForce。不是将数据分组,而是将组位置传递到sum和mean(gsum和gmean)的分组版本中,然后在针对高速缓存效率的单个顺序通过列中计算所有组的结果。此外,由于g *函数只被调用一次,我们不需要找到方法来加速对每个组重复调用sum和mean。提交以下代码时


  DT < -  data.table(A = c(NA,NA,1:3),B = c(a,NA,letters [1:3]))
DT [,sum (A,na.rm = TRUE),by = B]



 
B V1
1:a 1
2:NA 0
3:b 2
4: c 3

并且当尝试 DT [,sum(A,na.rm = FALSE),by = B ] ,我有

 
B V1
1:a NA
2:NA NA
3:b 2
4:c 3

这个结果解释了GForce do,添加 na.rm = TRUE / FALSE 选项?



b $ b

解决方案

na.rm 无关。你显示工作得很好之前,以及。但是,我可以看到你为什么会这样想。下面是相同的NEWS项目的其余部分:

  GForce现在应用的示例:
DT [,sum ,na.rm =),by = ...]#是
DT [,list(sum(x,na.rm =),mean(y,na.rm =)),by = ... ]#yes
DT [,lapply(.SD,sum,na.rm =),by = ...]#是
DT [,list(sum(x),min(y)) ,by = ...]#no。 gmin不可用
GForce是一个2级优化。要关闭它:options(datatable.optimize = 1)
提醒:要查看优化和其他信息,请设置verbose = TRUE

您不需要做任何有益的事情,这是一个自动优化。



这里有一个关于5亿行和4列(13GB)。首先创建并说明数据:

  $ R 
R版本3.0.2(2013-09-25) - Frisbee Sailing
版权所有(C)2013 R统计计算基金会
平台:x86_64-pc-linux-gnu(64位)

> require(data.table)
加载所需的包:data.table
data.table 1.9.2对于帮助类型:help(data.table)

> DT = data.table(grp = sample(1e6,5e8,replace = TRUE),
a = rnorm(1e6),
b = rnorm(1e6),
c = rnorm(1e6))
> tables()
NAME NROW MB COLS KEY
[1,] DT 500,000,000 13352 grp,a,b,c
总计:13,352MB
> print(DT)
grp abc
1e + 00:695059 -1.4055192 1.587540028 1.7104991
2e + 00:915263 -0.8239298 -0.513575696 -0.3429516
3e + 00:139937 -0.2202024 0.971816721 1.0597421
4e + 00:651525 1.0026858 -1.157824780 0.3100616
5e + 00:438180 1.1074729 -2.513939427 0.8357155
---
5e + 08:705823 -1.4773420 0.004369457 -0.2867529
5e + 08:716694 -0.6826147 -0.357086020 -0.4044164
5e + 08:217509 0.4939808 -0.012797093 -1.1084564
5e + 08:501760 1.7081212 -1.772721799 -0.7119432
5e + 08:765653 -1.1141456 -1.569578263 0.4947304

现在开始使用GForce优化注意这里没有 setkey 。这是通过或 ad hoc by 而被称为冷的习惯,这是常见的做法,当你想用许多不同的方式分组。

 > system.time(ans1 <-DT [,lapply(.SD,sum),by = grp])
用户系统已过
47.520 5.651 53.173
>
用户系统已经过
47.372 5.676 53.049#立即重复以确认时间
code>

现在关闭GForce优化(根据NEWS项目),看看它的区别:

 > options(datatable.optimize = 1)

> system.time(ans2 <-DT [,lapply(.SD,sum),by = grp])
用户系统已过
97.274 3.383 100.659
>
用户系统已经过
97.199 3.423 100.624#立即重复以确认时间
code>



最后,确认结果是一样的:

 > same(ans1,ans2)
[1] TRUE
> print(ans1)
grp abc
1:695059 16.791281 13.269647 -10.663118
2:915263 43.312584 -33.587933 4.490842
3:139937 3.967393 -10.386636 -3.766019
4: 651525 -4.152362 9.339594 7.740136
5:438180 4.725874 26.328877 9.063309
---
999996:372601 -2.087248 -19.936420 21.172860
999997:13912 18.414226 -1.744378 -7.951381
999998:150074 -4.031619 8.433173 -22.041731
999999:385718 11.527876 6.807802 7.405016
1000000:906246 -13.857315 -23.702011 6.605254

请注意, data.table 根据组首次出现时保留组的顺序。要订购分组结果,请使用 keyby = 而不是 by =



要重新启用GForce优化(默认为 Inf 以从所有优化中受益):

 > options(datatable.optimize = Inf)

Aside:如果你不熟悉 lapply(.SD,...)语法,它只是一种按组按列应用函数的方法。例如,这两行是等价的:

  DT [,lapply(.SD,sum),by = grp]# 1)
DT [,list(sum(a),sum(b),sum(c))by = grp]#(2)完全相同
/ pre>

第一个(1)更有用,因为你有更多的列,特别是与 .SDcols 以控制哪些子列的列应用函数通过。



NEWS项目只是试图表达,使用这些语法无关紧要,您传递 na.rm 或不,GForce优化仍将应用。它说,你可以在一个调用中混合 sum() mean() ,但是一旦你做了别的事情(例如 min()),GForce就不会在 min 还没有完成;只有意味着 sum 目前有GForce优化。您可以使用 verbose = TRUE 查看是否正在应用GForce。



此时间机器的详细信息:

  $ lscpu 
架构:x86_64
CPU运行模式:32位, 64位
字节顺序:Little Endian
CPU:8
在线CPU列表:0-7
每个内核的线程:8
每个套接字的核心:1
套接字:1
NUMA节点:1
供应商ID:GenuineIntel
CPU系列:6
型号:62
步进:4
CPU MHz:2494.022
BogoMIPS:4988.04
虚拟机管理程序供应商:Xen
虚拟化类型:full
L1d高速缓存:32K
L1i cache:32K
L2 cache:256K
L3 cache:25600K
NUMA node0 CPU:0-7


I don't know how to make great advantage of GForce in data.table 1.9.2

New optimization: GForce. Rather than grouping the data, the group locations are passed into grouped versions of sum and mean (gsum and gmean) which then compute the result for all groups in a single sequential pass through the column for cache efficiency. Further, since the g*function is called just once, we don't need to find ways to speed up calling sum or mean repetitively for each group. `

when submitting the following code

DT <- data.table(A=c(NA,NA,1:3), B=c("a",NA,letters[1:3]))
DT[,sum(A,na.rm=TRUE),by= B]

I got this

    B V1
1:  a  1
2: NA  0
3:  b  2
4:  c  3

and when trying DT[,sum(A,na.rm=FALSE),by= B], I got

    B  V1
1:  a  NA
2:  NA NA
3:  b  2
4:  c  3

Does that results explain what the GForce do, Adding the na.rm = TRUE/FALSE option?

Thanks a lot!

解决方案

It's nothing to do with na.rm. What you show worked fine before as well. However, I can see why you might have thought that. Here is the rest of the same NEWS item :

Examples where GForce applies now :
    DT[,sum(x,na.rm=),by=...]                       # yes
    DT[,list(sum(x,na.rm=),mean(y,na.rm=)),by=...]  # yes
    DT[,lapply(.SD,sum,na.rm=),by=...]              # yes
    DT[,list(sum(x),min(y)),by=...]                 # no. gmin not yet available
GForce is a level 2 optimization. To turn it off: options(datatable.optimize=1)
Reminder: to see the optimizations and other info, set verbose=TRUE

You don't need to do anything to benefit, it's an automatic optimization.

Here's an example on 500 million rows and 4 columns (13GB). First create and illustrate the data :

$ R
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

> require(data.table)
Loading required package: data.table
data.table 1.9.2  For help type: help("data.table")

> DT = data.table( grp = sample(1e6,5e8,replace=TRUE), 
                   a = rnorm(1e6),
                   b = rnorm(1e6),
                   c = rnorm(1e6))
> tables()
     NAME        NROW    MB COLS      KEY
[1,] DT   500,000,000 13352 grp,a,b,c    
Total: 13,352MB
> print(DT)
          grp          a            b          c
1e+00: 695059 -1.4055192  1.587540028  1.7104991
2e+00: 915263 -0.8239298 -0.513575696 -0.3429516
3e+00: 139937 -0.2202024  0.971816721  1.0597421
4e+00: 651525  1.0026858 -1.157824780  0.3100616
5e+00: 438180  1.1074729 -2.513939427  0.8357155
   ---                                          
5e+08: 705823 -1.4773420  0.004369457 -0.2867529
5e+08: 716694 -0.6826147 -0.357086020 -0.4044164
5e+08: 217509  0.4939808 -0.012797093 -1.1084564
5e+08: 501760  1.7081212 -1.772721799 -0.7119432
5e+08: 765653 -1.1141456 -1.569578263  0.4947304

Now time with GForce optimization on (the default). Notice here there is no setkey first. This is what's known as cold by or ad hoc by which is common practice when you want to group in lots of different ways.

> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 47.520   5.651  53.173 
> system.time(ans1 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 47.372   5.676  53.049      # immediate repeat to confirm timing

Now turn off GForce optimization (as per NEWS item) to see the difference it makes :

> options(datatable.optimize=1)

> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 97.274   3.383 100.659 
> system.time(ans2 <- DT[, lapply(.SD,sum), by=grp])
   user  system elapsed 
 97.199   3.423 100.624      # immediate repeat to confirm timing

Finally, confirm the results are the same :

> identical(ans1,ans2)
[1] TRUE
> print(ans1)
            grp          a          b          c
      1: 695059  16.791281  13.269647 -10.663118
      2: 915263  43.312584 -33.587933   4.490842
      3: 139937   3.967393 -10.386636  -3.766019
      4: 651525  -4.152362   9.339594   7.740136
      5: 438180   4.725874  26.328877   9.063309
     ---                                        
 999996: 372601  -2.087248 -19.936420  21.172860
 999997:  13912  18.414226  -1.744378  -7.951381
 999998: 150074  -4.031619   8.433173 -22.041731
 999999: 385718  11.527876   6.807802   7.405016
1000000: 906246 -13.857315 -23.702011   6.605254

Notice that data.table retains the order of the groups according to when they first appeared. To order the grouped result, use keyby= instead of by=.

To turn GForce optimization back on (default is Inf to benefit from all optimizations) :

> options(datatable.optimize=Inf)

Aside : if you're not familiar with the lapply(.SD,...) syntax, it's just a way to apply a function through columns by group. For example, these two lines are equivalent :

 DT[, lapply(.SD,sum), by=grp]               # (1)
 DT[, list(sum(a),sum(b),sum(c)), by=grp]    # (2) exactly the same

The first (1) is more useful as you have more columns, especially in combination with .SDcols to control which subset of columns to apply the function through.

The NEWS item was just trying to convey that it doesn't matter which of these syntax is used, or whether you pass na.rm or not, GForce optimization will still be applied. It's saying that you can mix sum() and mean() in one call (which syntax (2) allows), but as soon as you do something else (like min()), then GForce won't kick in since min isn't done yet; only mean and sum have GForce optimizations currently. You can use verbose=TRUE to see if GForce is being applied.

Details of the machine used for this timing :

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    8
Core(s) per socket:    1
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.022
BogoMIPS:              4988.04
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7

这篇关于关于数据表1.9.2中的GForce的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆