提高data.table日期+时间粘贴的性能? [英] Improve performance of data.table date+time pasting?

查看:121
本文介绍了提高data.table日期+时间粘贴的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不知道我可以在这里提出这个问题,让我知道,如果我应该在别的地方做。

I am not sure that I can ask this question here, let me know if I should do it somewhere else.

我有一个data.table 1e6行具有以下结构:

I have a data.table with 1e6 rows having this structure:

        V1       V2     V3
1: 03/09/2011 08:05:40 1145.0
2: 03/09/2011 08:06:01 1207.3
3: 03/09/2011 08:06:17 1198.8
4: 03/09/2011 08:06:20 1158.4
5: 03/09/2011 08:06:40 1112.2
6: 03/09/2011 08:06:59 1199.3

我使用以下代码将V1和V2变量转换为唯一的datetime变量:

I am converting the V1 and V2 variables to a unique datetime variable, using this code:

 system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2),
                         format='%d/%m/%Y %H:%M:%S'),
                     V1=NULL,V2=NULL)])

   user  system elapsed 
   47.47    0.16   50.27 

有没有任何方法来提高这种转换的性能?

Is there any method to improve performance of this transformation?

此处 dput(head(DT))

DT <- structure(list(V1 = c("03/09/2011", "03/09/2011", "03/09/2011", 
"03/09/2011", "03/09/2011", "03/09/2011"), V2 = c("08:05:40", 
"08:06:01", "08:06:17", "08:06:20", "08:06:40", "08:06:59"), 
    V3 = c(1145, 1207.3, 1198.8, 1158.4, 1112.2, 1199.3)), .Names = c("V1", 
"V2", "V3"), class = c("data.table", "data.frame"), row.names = c(NA, 
-6L), .internal.selfref = <pointer: 0x00000000002a0788>)


推荐答案

这种方法看起来运行速度比OP快40倍,使用查找表并利用极快的数据表连接。此外,它利用了这样的事实,即虽然可能存在1e6个日期和时间的组合,但是可以存在至多86400个独特时间,并且可能甚至更少的日期。最后,它避免使用粘贴(...)

This approach, which appears to run ~40X faster than OP's, uses lookup tables and takes advantage of the extremely fast data table joins. Also, it takes advantage of the fact that, while there may be 1e6 combinations of date and time, there can be at most 86400 unique times, and probably even fewer dates. Finally, it avoids the use of paste(...) altogether.

library(data.table)
library(stringr)

# create a dataset with 1MM rows
set.seed(1)
x  <- 1000*sample(1:1e5,1e6,replace=T)
dt <- data.table(id=1:1e6,
                 V1=format(as.POSIXct(x,origin="2011-01-01"),"%d/%m/%Y"),
                 V2=format(as.POSIXct(x,origin="2011-01-01"),"%H:%M:%S"),
                 V3=x)
DT <- dt

index.date <- function(dt) {
  # Edit: this change processes only times from the dataset; slightly more efficient
  V2 <- unique(dt$V2)
  dt.time <- data.table(char.time=V2,
                        int.time=as.integer(substr(V2,7,8))+
                          60*(as.integer(substr(V2,4,5))+
                                60*as.integer(substr(V2,1,2))))
  setkey(dt.time,char.time)
  # all dates from dataset
  dt.date <- data.table(char.date=unique(dt$V1), int.date=as.integer(as.POSIXct(unique(dt$V1),format="%d/%m/%Y")))
  setkey(dt.date,char.date)
  # join the dates
  setkey(dt,V1)
  dt <- dt[dt.date]
  # join the times
  setkey(dt,V2)
  dt <- dt[dt.time, nomatch=0]
  # numerical index
  dt[,int.index:=int.date+int.time]
  # POSIX date index
  dt[,index:=as.POSIXct(int.index,origin='1970-01-01')]
  # get back original order
  setkey(dt,id)
  return(dt)
}
# new approach
system.time(dt<-index.date(dt))
#   user  system elapsed 
#   2.26    0.00    2.26 


# original approach
DT <- dt
system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2),
                                       format='%d/%m/%Y %H:%M:%S'),
                     V1=NULL,V2=NULL)])
#   user  system elapsed 
#  84.33    0.06   84.52 

请注意,性能取决于多少唯一日期有。在测试用例中有大约1200个唯一的日期。

Note that performance does depend on how many unique dates there are. In the test case there were ~1200 unique dates.

EDIT 命令将函数写入更多的data.table-sugar语法, $用于子集:

EDIT proposition to write the function in more data.table-sugar syntax and avoid "$" for subsetting:

index.date <- function(dt,fmt="%d/%m/%Y") {
    dt.time <- data.table(char.time = dt[,unique(V2)],key='char.time')
    dt.time[,int.time :=as.integer(substr(char.time,7,8))+
                                            60*(as.integer(substr(char.time,4,5))+
                                                        60*as.integer(substr(char.time,1,2)))]
    # all dates from dataset
    dt.date <- data.table(char.date = dt[,unique(V1)],key='char.date')
    dt.date[,int.date:=as.integer(as.POSIXct(char.date,format=fmt))]
    # join the dates
    setkey(dt,V1)
    dt <- dt[dt.date]
    # join the times
    setkey(dt,V2)
    dt <- dt[dt.time, nomatch=0]
    # numerical index
    dt[,int.index:=int.date+int.time]
    # POSIX date index
    dt[,index:=as.POSIXct.numeric(int.index,origin='1970-01-01')]
    # remove extra/temporary variables
    dt[,`:=`(int.index=NULL,int.date=NULL,int.time=NULL)]
}

这篇关于提高data.table日期+时间粘贴的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆