为什么as.Date在字符向量上慢? [英] Why is as.Date slow on a character vector?

查看:92
本文介绍了为什么as.Date在字符向量上慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始在R中使用data.table包来提高我的代码性能。我使用以下代码:

  sp500 < -  read.csv('../ rawdata / GMTSP.csv') 
天< - c(Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday)

#表格以获得更快的东西
sp500< - data.table(sp500,key =Date)
sp500< - sp500 [,Date:= as.Date(Date,% m /%d /%Y)]
sp500 <-sp500 [,工作日:=因子(周日(sp500 [,日期]),级别=天,有序= T)]
sp500 ; - sp500 [,Year:=(as.POSIXlt(Date)$ year + 1900)]
sp500 < - sp500 [,Month:=(as.POSIXlt(Date)$ mon + 1)]

我注意到,与其他创建工作日的函数相比,as.Date函数的转换非常慢,等等。为什么呢?有没有更好/更快的解决方案,如何转换为日期格式? (如果你会询问我是否真的需要日期格式,可能是的,因为那么使用ggplot2来绘制图表,这就像这种类型的数据的魅力。)



更精确

 > system.time(sp500 <-sp500 [,Date:= as.Date(Date,%m /%d /%Y)])
用户系统已过
92.603 0.289 93.014
。 system.time(sp500 <-sp500 [,Weekday:= factor(weekdays(sp500 [,Date]),levels = days,ordered = T)])
用户系统已过
1.938 0.062 2.001
> system.time(sp500 <-sp500 [,Year:=(as.POSIXlt(Date)$ year + 1900)])
用户系统已过
0.304 0.001 0.305

在MacAir i5上观察次数少于300万次。



>

解决方案

我认为这只是 as.Date 转换通过 POSIXlt 使用 strptime 将到 c $ c>。

要自己跟踪它,输入 as.Date ,然后方法(as.Date),然后查看字符方法。

 > as.Date 
function(x,...)
UseMethod(as.Date)
< bytecode:0x2cf4b20>
< environment:namespace:base>

>方法(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate * as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
不可见的函数有星号

> as.Date.character
function(x,format =,...)
{
charToDate< - function(x){
xx& 1L]
if(is.na(xx)){
j <-1L
while(is.na(xx)&&(j < - j + 1L) ; = length(x))xx <-x [j]
if(is.na(xx))
f < - %Y-%m-%d
}
if(is.na(xx)||!is.na(strptime(xx,f < - %Y-%m-%d,
tz =GMT)) |!is.na(strptime(xx,f < - %Y /%m /%d,
tz =GMT)))
return(strptime(x,f))
stop(字符串不是标准的无歧义格式)
}
res< - if(missing(format))
charToDate(x)
else strptime(x,format,tz =GMT)#### slow part,I think ####
as.Date(res)
}
< bytecode:0x2cf6da0> ;
< environment:namespace:base>
>

为什么 as.POSIXlt(Date)$ year + 1900 比较快?再次通过:

 > as.POSIXct 
function(x,tz =,...)
UseMethod(as.POSIXct)
< bytecode:0x2936de8&
< environment:namespace:base>

>方法(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate * as.POSIXct。 ITime * as.POSIXct.numeric as.POSIXct.POSIXlt
不可见的函数有星号

> as.POSIXlt.Date
function(x,...)
{
y < - .Internal(Date2POSIXlt(x))
names(y $ year) names(x)
y
}
< bytecode:0x395e328>
< environment:namespace:base>
>

Intrigued,让我们来看看Date2POSIXlt。对于这个位,我们需要grep main / src知道要查看哪个.c文件。

 〜/ R / Rtrunk / src / main $ grep Date2POSIXlt * 
names.c:{Date2POSIXlt,do_D2POSIXlt,0,11,1,{PP_FUNCALL,PREC_FN,0}},
$

现在我们知道我们需要寻找D2POSIXlt:

 〜/ R / Rtrunk / src / main $ grep D2POSIXlt * 
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call,SEXP op,SEXP args,SEXP env)
names.c: {Date2POSIXlt,do_D2POSIXlt,0,11,1,{PP_FUNCALL,PREC_FN,0}},
$


b $ b

哦,我们可以猜测datetime.c。无论如何,所以看最新的现场拷贝:



datetime.c



在其中搜索 D2POSIXlt 你会看到它是多么简单,从日期(数字)到POSIXlt。你还将看到POSIXlt是一个实向量(8字节)加上7个整数向量(每个4字节)。这是40字节,每个日期!



所以,问题的症结(我认为)是为什​​么 strptime 是这样缓慢,也许可以在R中改善。或者直接或间接避免 POSIXlt






这里是一个可重现的示例,使用的问题数量(3,000,000):

  > Range = seq(as.Date(2000-01-01),as.Date(2012-01-01),by =days)
> Date = format(sample(Range,3000000,replace = TRUE),%m /%d /%Y)
> system.time(as.Date(Date,%m /%d /%Y))
用户系统已过
21.681 0.060 21.760
> system.time(strptime(Date,%m /%d /%Y))
用户系统已过
29.594 8.633 38.270
> system.time(strptime(Date,%m /%d /%Y,tz =GMT))
用户系统已过
19.785 0.000 19.802

传递 tz 似乎加速了 strptime as.Date.character 。所以也许这取决于你的地区。但是 strptime 似乎是罪魁祸首,而不是 data.table 。也许重新运行这个例子,看看你的机器上是否需要90秒?


I started using data.table package in R to boost performance of my code. I am using the following code:

sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]
sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]

I noticed that the conversion done by as.Date function is very slow, when compared to other functions that create weekdays, etc. Why is that? Is there a better/faster solution, how to convert into date-format? (If you would ask whether I really need the date format, probably yes, because then use ggplot2 to make plots, which work like a charm with this type of data.)

To be more precise

> system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")])
   user  system elapsed 
 92.603   0.289  93.014 
> system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)])
   user  system elapsed 
  1.938   0.062   2.001 
> system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)])
   user  system elapsed 
  0.304   0.001   0.305 

On MacAir i5 with slightly less then 3000000 observations.

Thanks

解决方案

I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
[5] as.Date.factor    as.Date.IDate*    as.Date.numeric   as.Date.POSIXct  
[9] as.Date.POSIXlt  
   Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...) 
{
    charToDate <- function(x) {
        xx <- x[1L]
        if (is.na(xx)) {
            j <- 1L
            while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
            if (is.na(xx)) 
                f <- "%Y-%m-%d"
        }
        if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", 
            tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", 
            tz = "GMT"))) 
            return(strptime(x, f))
        stop("character string is not in a standard unambiguous format")
    }
    res <- if (missing(format)) 
        charToDate(x)
    else strptime(x, format, tz = "GMT")       ####  slow part, I think  ####
    as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
> 

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...) 
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date    as.POSIXct.Date    as.POSIXct.dates   as.POSIXct.default
[5] as.POSIXct.IDate*  as.POSIXct.ITime*  as.POSIXct.numeric as.POSIXct.POSIXlt
   Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...) 
{
    y <- .Internal(Date2POSIXlt(x))
    names(y$year) <- names(x)
    y
}
<bytecode: 0x395e328>
<environment: namespace:base>
> 

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.


Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
   user  system elapsed 
 21.681   0.060  21.760 
> system.time(strptime(Date, "%m/%d/%Y"))
   user  system elapsed 
 29.594   8.633  38.270 
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
   user  system elapsed 
 19.785   0.000  19.802 

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

这篇关于为什么as.Date在字符向量上慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆