为什么 as.Date 在字符向量上很慢? [英] Why is as.Date slow on a character vector?
问题描述
我开始在 R 中使用 data.table 包来提高代码的性能.我正在使用以下代码:
I started using data.table package in R to boost performance of my code. I am using the following code:
sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]
sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]
我注意到 as.Date 函数完成的转换与创建工作日等的其他函数相比非常慢.这是为什么呢?有没有更好/更快的解决方案,如何转换为日期格式?(如果你问我是否真的需要日期格式,可能是的,因为然后使用 ggplot2 来制作绘图,这对这种类型的数据来说就像一个魅力.)
I noticed that the conversion done by as.Date function is very slow, when compared to other functions that create weekdays, etc. Why is that? Is there a better/faster solution, how to convert into date-format? (If you would ask whether I really need the date format, probably yes, because then use ggplot2 to make plots, which work like a charm with this type of data.)
更准确地说
> system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")])
user system elapsed
92.603 0.289 93.014
> system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)])
user system elapsed
1.938 0.062 2.001
> system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)])
user system elapsed
0.304 0.001 0.305
在 MacAir i5 上,观测值略低于 3000000.
On MacAir i5 with slightly less then 3000000 observations.
推荐答案
我觉得只是as.Date
把character
转成Date
通过 POSIXlt
,使用 strptime
.我相信 strptime
非常慢.
I think it's just that as.Date
converts character
to Date
via POSIXlt
, using strptime
. And strptime
is very slow, I believe.
要自己跟踪它,输入 as.Date
,然后输入 methods(as.Date)
,然后查看 character
方法.
To trace it through yourself, type as.Date
, then methods(as.Date)
, then look at the character
method.
> as.Date
function (x, ...)
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>
> methods(as.Date)
[1] as.Date.character as.Date.date as.Date.dates as.Date.default
[5] as.Date.factor as.Date.IDate* as.Date.numeric as.Date.POSIXct
[9] as.Date.POSIXlt
Non-visible functions are asterisked
> as.Date.character
function (x, format = "", ...)
{
charToDate <- function(x) {
xx <- x[1L]
if (is.na(xx)) {
j <- 1L
while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
if (is.na(xx))
f <- "%Y-%m-%d"
}
if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d",
tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d",
tz = "GMT")))
return(strptime(x, f))
stop("character string is not in a standard unambiguous format")
}
res <- if (missing(format))
charToDate(x)
else strptime(x, format, tz = "GMT") #### slow part, I think ####
as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
>
为什么 as.POSIXlt(Date)$year+1900
比较快?再次,追踪它:
Why is as.POSIXlt(Date)$year+1900
relatively fast? Again, trace it through :
> as.POSIXct
function (x, tz = "", ...)
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>
> methods(as.POSIXct)
[1] as.POSIXct.date as.POSIXct.Date as.POSIXct.dates as.POSIXct.default
[5] as.POSIXct.IDate* as.POSIXct.ITime* as.POSIXct.numeric as.POSIXct.POSIXlt
Non-visible functions are asterisked
> as.POSIXlt.Date
function (x, ...)
{
y <- .Internal(Date2POSIXlt(x))
names(y$year) <- names(x)
y
}
<bytecode: 0x395e328>
<environment: namespace:base>
>
很感兴趣,让我们深入研究 Date2POSIXlt.对于这一点,我们需要 grep main/src 来知道要查看哪个 .c 文件.
Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.
~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
现在我们知道我们需要寻找 D2POSIXlt :
Now we know we need to look for D2POSIXlt :
~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt, 0, 11, 1, {PP_FUNCALL, PREC_FN, 0}},
$
哦,我们可以猜到 datetime.c.无论如何,看看最新的现场副本:
Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :
在那里搜索 D2POSIXlt
,您会发现从 Date(数字)到 POSIXlt 是多么简单.您还将看到 POSIXlt 如何是一个实数向量(8 个字节)加上七个整数向量(每个 4 个字节).每个日期 40 个字节!
Search in there for D2POSIXlt
and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!
所以问题的症结(我认为)是为什么 strptime
这么慢,也许可以在 R 中改进.或者直接避免 POSIXlt
或间接.
So the crux of the issue (I think) is why strptime
is so slow, and maybe that can be improved in R. Or just avoid POSIXlt
, either directly or indirectly.
这是一个使用相关项目数 (3,000,000) 的可重现示例:
Here's a reproducible example using the number of items stated in question (3,000,000) :
> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
user system elapsed
21.681 0.060 21.760
> system.time(strptime(Date, "%m/%d/%Y"))
user system elapsed
29.594 8.633 38.270
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
user system elapsed
19.785 0.000 19.802
传递 tz
似乎可以加快 strptime
,而 as.Date.character
确实如此.所以也许这取决于你的语言环境.但是 strptime
似乎是罪魁祸首,而不是 data.table
.或许重新运行这个例子,看看你的机器上是否需要 90 秒?
Passing tz
appears to speed up strptime
, which as.Date.character
does. So maybe it depends on your locale. But strptime
appears to be the culprit, not data.table
. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?
这篇关于为什么 as.Date 在字符向量上很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!