为什么as.Date在字符向量上慢? [英] Why is as.Date slow on a character vector?
问题描述
我开始在R中使用data.table包来提高我的代码性能。我使用以下代码:
sp500 < - read.csv('../ rawdata / GMTSP.csv')
天< - c(Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday)
#表格以获得更快的东西
sp500< - data.table(sp500,key =Date)
sp500< - sp500 [,Date:= as.Date(Date,% m /%d /%Y)]
sp500 <-sp500 [,工作日:=因子(周日(sp500 [,日期]),级别=天,有序= T)]
sp500 ; - sp500 [,Year:=(as.POSIXlt(Date)$ year + 1900)]
sp500 < - sp500 [,Month:=(as.POSIXlt(Date)$ mon + 1)]
我注意到,与其他创建工作日的函数相比,as.Date函数的转换非常慢,等等。为什么呢?有没有更好/更快的解决方案,如何转换为日期格式? (如果你会询问我是否真的需要日期格式,可能是的,因为那么使用ggplot2来绘制图表,这就像这种类型的数据的魅力。)
更精确
> system.time(sp500 <-sp500 [,Date:= as.Date(Date,%m /%d /%Y)])
用户系统已过
92.603 0.289 93.014
。 system.time(sp500 <-sp500 [,Weekday:= factor(weekdays(sp500 [,Date]),levels = days,ordered = T)])
用户系统已过
1.938 0.062 2.001
> system.time(sp500 <-sp500 [,Year:=(as.POSIXlt(Date)$ year + 1900)])
用户系统已过
0.304 0.001 0.305
在MacAir i5上观察次数少于300万次。
>
我认为这只是 要自己跟踪它,输入 为什么 Intrigued,让我们来看看Date2POSIXlt。对于这个位,我们需要grep main / src知道要查看哪个.c文件。 现在我们知道我们需要寻找D2POSIXlt: 哦,我们可以猜测datetime.c。无论如何,所以看最新的现场拷贝: 在其中搜索 所以,问题的症结(我认为)是为什么 这里是一个可重现的示例,使用的问题数量(3,000,000): 传递 I started using data.table package in R to boost performance of my code. I am using the following code: I noticed that the conversion done by as.Date function is very slow, when compared to other functions that create weekdays, etc. Why is that? Is there a better/faster solution, how to convert into date-format? (If you would ask whether I really need the date format, probably yes, because then use ggplot2 to make plots, which work like a charm with this type of data.) To be more precise On MacAir i5 with slightly less then 3000000 observations. Thanks I think it's just that To trace it through yourself, type Why is Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at. Now we know we need to look for D2POSIXlt : Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy : Search in there for So the crux of the issue (I think) is why Here's a reproducible example using the number of items stated in question (3,000,000) : Passing 这篇关于为什么as.Date在字符向量上慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! as.Date
转换通过
POSIXlt
使用 strptime $>将到
c $ c>。
as.Date
,然后方法(as.Date)
,然后查看字符
方法。
> as.Date
function(x,...)
UseMethod(as.Date)
< bytecode:0x2cf4b20>
< environment:namespace:base>
>方法(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate * as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
不可见的函数有星号
> as.Date.character
function(x,format =,...)
{
charToDate< - function(x){
xx& 1L]
if(is.na(xx)){
j <-1L
while(is.na(xx)&&(j < - j + 1L) ; = length(x))xx <-x [j]
if(is.na(xx))
f < - %Y-%m-%d
}
if(is.na(xx)||!is.na(strptime(xx,f < - %Y-%m-%d,
tz =GMT)) |!is.na(strptime(xx,f < - %Y /%m /%d,
tz =GMT)))
return(strptime(x,f))
stop(字符串不是标准的无歧义格式)
}
res< - if(missing(format))
charToDate(x)
else strptime(x,format,tz =GMT)#### slow part,I think ####
as.Date(res)
}
< bytecode:0x2cf6da0> ;
< environment:namespace:base>
>
as.POSIXlt(Date)$ year + 1900
比较快?再次通过:
> as.POSIXct
function(x,tz =,...)
UseMethod(as.POSIXct)
< bytecode:0x2936de8&
< environment:namespace:base>
>方法(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate * as.POSIXct。 ITime * as.POSIXct.numeric as.POSIXct.POSIXlt
不可见的函数有星号
> as.POSIXlt.Date
function(x,...)
{
y < - .Internal(Date2POSIXlt(x))
names(y $ year) names(x)
y
}
< bytecode:0x395e328>
< environment:namespace:base>
>
〜/ R / Rtrunk / src / main $ grep Date2POSIXlt *
names.c:{Date2POSIXlt,do_D2POSIXlt,0,11,1,{PP_FUNCALL,PREC_FN,0}},
$
〜/ R / Rtrunk / src / main $ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call,SEXP op,SEXP args,SEXP env)
names.c: {Date2POSIXlt,do_D2POSIXlt,0,11,1,{PP_FUNCALL,PREC_FN,0}},
$
b $ b
D2POSIXlt
你会看到它是多么简单,从日期(数字)到POSIXlt。你还将看到POSIXlt是一个实向量(8字节)加上7个整数向量(每个4字节)。这是40字节,每个日期!
strptime
是这样缓慢,也许可以在R中改善。或者直接或间接避免 POSIXlt
。
> Range = seq(as.Date(2000-01-01),as.Date(2012-01-01),by =days)
> Date = format(sample(Range,3000000,replace = TRUE),%m /%d /%Y)
> system.time(as.Date(Date,%m /%d /%Y))
用户系统已过
21.681 0.060 21.760
> system.time(strptime(Date,%m /%d /%Y))
用户系统已过
29.594 8.633 38.270
> system.time(strptime(Date,%m /%d /%Y,tz =GMT))
用户系统已过
19.785 0.000 19.802
tz
似乎加速了 strptime
, as.Date.character
。所以也许这取决于你的地区。但是 strptime
似乎是罪魁祸首,而不是 data.table
。也许重新运行这个例子,看看你的机器上是否需要90秒?sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]
sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]
> system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")])
user system elapsed
92.603 0.289 93.014
> system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)])
user system elapsed
1.938 0.062 2.001
> system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)])
user system elapsed
0.304 0.001 0.305
as.Date
converts character
to Date
via POSIXlt
, using strptime
. And strptime
is very slow, I believe.as.Date
, then methods(as.Date)
, then look at the character
method.> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>
> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked
> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>
as.POSIXlt(Date)$year+1900
relatively fast? Again, trace it through :> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>
> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked
> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>
~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
D2POSIXlt
and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!strptime
is so slow, and maybe that can be improved in R. Or just avoid POSIXlt
, either directly or indirectly.
> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802
tz
appears to speed up strptime
, which as.Date.character
does. So maybe it depends on your locale. But strptime
appears to be the culprit, not data.table
. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?