从data.table和data.frame对象获取单个elemets的时间 [英] Time in getting single elemets from data.table and data.frame objects

查看:99
本文介绍了从data.table和data.frame对象获取单个elemets的时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的工作中,我使用了几个表(客户详细信息,事务记录等)。由于其中一些是非常大(百万行),我最近切换到 data.table 包(感谢Matthew)。然而,其中一些是相当小(几百行和4/5列),并被称为几次。因此,我开始在检索数据中考虑 [。data.table ]的开销,而不是设置()ting值,如?set ,其中,无论表的大小,一个项目设置在大约2微秒(取决于cpu)。



然而,它似乎不存在等效于 set data.table 知道确切的行和列。一种 loopable [。data.table

  library(data.table)
library(microbenchmark)

m = matrix(1,nrow = 100000,ncol = 100)
DF = as.data.frame (m)
DT = as.data.table(m)#用于?set中的相同数据

>微基准(DF [3450,1],DT [3450,V1],times = 1000)#在DT

中的开销更大单位:微秒
expr min lq median uq max neval
DF [3450,1] 32.745 36.166 40.5645 43.497 193.533 1000
DT [3450,V1] 788.791 803.453 813.2270 832.287 5826.982 1000

>微基准(DF $ V1 [3450],DT [3450,1,with = F],times = 1000)#使用原子矢量和
#删除DT开销的一部分
单位:微秒
expr min lq median uq max neval
DF $ V1 2.933 3.910 5.865 6.354 36.166 1000
DT [3450,1,with = F] 297.629 303.494 305.938 309.359 1878.632 1000

。微基准(DF $ V1 [3450],DT $ V1 [3450],times = 1000)#仅使用原子向量
单位:微秒
expr min lq median uq max neval
DF $ V1 [3450] 2.933 2.933 3.421 3.422 40.565 1000#DF似乎仍然有点快(23%)
DT $ V1 [3450] 3.910 3.911 4.399 4.399 16.128 1000

最后一个方法确实是最快的一个快速检索单个元素几次。然而, set 更快。

 微基准(设置(DT,1L,1L,5L),次数= 1000)
单位:微秒
expr min lq median uq max neval
set(DT,1L,1L,5L)1.955 1.956 2.444 2.444 24.926 1000

问题是:如果我们可以 set 在2.444微秒的值不可能以更小(或至少相似)的时间量获取一个值?非常感谢。



编辑:
添加了两个建议的选项:

 > microbenchmark(`[.data.frame`(DT,3450,1),DT [[V1]] [3450],times = 1000)
单位:微秒
expr min lq median uq max neval
`[.data.frame`(DT,3450,1)46.428 47.895 48.383 48.872 2165.509 1000
DT [[V1]] [3450] 20.038 21.504 23.459 24.437 116.316 1000

>解决方案

感谢@hadley我们有解决方案!

 > microbenchmark(DT $ V1 [3450],set(DT,1L,1L,5L),.subset2(DT,V1),times = 1000,unit =us)
单位:
expr min lq median uq max neval
DT $ V1 [3450] 2.566 3.208 3.208 3.528 27.582 1000
set(DT,1L,1L,5L)1.604 1.925 1.925 2.246 15.074 1000
.subset2(DT,V1)0.000 0.321 0.322 0.642 8.339 1000


In my work I use to have several tables (customer details, transaction records, etc). Being some of them are very big (millions of rows), I've recently switched to the data.table package (thanks Matthew). However, some of them are quite small (few hundreds of rows and 4/5 column) and are called several times. Therefore I started to think about [.data.table overhead in retrieving data rather then set()ting value as already clearly described in ?set, where, regardless the size of table one item is set in around 2 microseconds (depending on cpu).

However it doesn't seem to exist the equivalent of set for getting a value from a data.table knowing the exact row and column. A sort of loopable [.data.table.

library(data.table)
library(microbenchmark)

m = matrix(1,nrow=100000,ncol=100)
DF = as.data.frame(m)
DT = as.data.table(m)  # same data used in ?set

> microbenchmark(DF[3450,1] , DT[3450, V1], times=1000) # much more overhead in DT

Unit: microseconds
expr     min      lq   median      uq      max neval
DF[3450, 1]  32.745  36.166  40.5645  43.497  193.533  1000
DT[3450, V1] 788.791 803.453 813.2270 832.287 5826.982  1000

> microbenchmark(DF$V1[3450], DT[3450, 1, with=F], times=1000)  # using atomic vector and
                                                                # removing part of DT overhead
Unit: microseconds                                              
expr     min      lq  median      uq      max neval
DF$V1[3450]   2.933   3.910   5.865   6.354   36.166  1000
DT[3450, 1, with = F] 297.629 303.494 305.938 309.359 1878.632  1000

> microbenchmark(DF$V1[3450], DT$V1[3450], times=1000) # using only atomic vectors
Unit: microseconds
        expr   min    lq median    uq    max neval
 DF$V1[3450] 2.933 2.933  3.421 3.422 40.565  1000    # DF seems still a bit faster (23%)
 DT$V1[3450] 3.910 3.911  4.399 4.399 16.128  1000

The last method is indeed the best one to fast retrieve a single element several times. However, set is even faster

> microbenchmark(set(DT,1L,1L,5L), times=1000)
Unit: microseconds
                expr   min    lq median    uq    max neval
 set(DT, 1L, 1L, 5L) 1.955 1.956  2.444 2.444 24.926  1000

the question is: if we can set a value in 2.444 microseconds shouldn't be possible to get a value in a smaller (or at least similar) amount of time? Thanks.

EDIT: adding two more options as suggested:

> microbenchmark(`[.data.frame`(DT,3450,1), DT[["V1"]][3450], times=1000)
Unit: microseconds
                        expr    min     lq median     uq      max neval
 `[.data.frame`(DT, 3450, 1) 46.428 47.895 48.383 48.872 2165.509  1000
            DT[["V1"]][3450] 20.038 21.504 23.459 24.437  116.316  1000

which unfortunately are not faster than the previous attempts.

解决方案

Thanks to @hadley we have the solution!

> microbenchmark(DT$V1[3450], set(DT,1L,1L,5L), .subset2(DT, "V1")[3450], times=1000, unit="us")
Unit: microseconds
                     expr   min    lq median    uq    max neval
              DT$V1[3450] 2.566 3.208  3.208 3.528 27.582  1000
      set(DT, 1L, 1L, 5L) 1.604 1.925  1.925 2.246 15.074  1000
 .subset2(DT, "V1")[3450] 0.000 0.321  0.322 0.642  8.339  1000

这篇关于从data.table和data.frame对象获取单个elemets的时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆