为什么 as.Date 在字符向量上很慢? [英] Why is as.Date slow on a character vector?

查看:22
本文介绍了为什么 as.Date 在字符向量上很慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我开始在 R 中使用 data.table 包来提高代码的性能.我正在使用以下代码:

I started using data.table package in R to boost performance of my code. I am using the following code:

sp500 <- read.csv('../rawdata/GMTSP.csv')
days <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

# Using data.table to get the things much much faster
sp500 <- data.table(sp500, key="Date")
sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")]
sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)]
sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)]
sp500 <- sp500[,Month:=(as.POSIXlt(Date)$mon+1)]

我注意到 as.Date 函数完成的转换与创建工作日等的其他函数相比非常慢.这是为什么呢?有没有更好/更快的解决方案,如何转换为日期格式?(如果你问我是否真的需要日期格式,可能是的,因为然后使用 ggplot2 来制作绘图,这对这种类型的数据来说就像一个魅力.)

I noticed that the conversion done by as.Date function is very slow, when compared to other functions that create weekdays, etc. Why is that? Is there a better/faster solution, how to convert into date-format? (If you would ask whether I really need the date format, probably yes, because then use ggplot2 to make plots, which work like a charm with this type of data.)

更准确地说

> system.time(sp500 <- sp500[,Date:=as.Date(Date, "%m/%d/%Y")])
   user  system elapsed 
 92.603   0.289  93.014 
> system.time(sp500 <- sp500[,Weekday:=factor(weekdays(sp500[,Date]), levels=days, ordered=T)])
   user  system elapsed 
  1.938   0.062   2.001 
> system.time(sp500 <- sp500[,Year:=(as.POSIXlt(Date)$year+1900)])
   user  system elapsed 
  0.304   0.001   0.305 

在 MacAir i5 上,观测值略低于 3000000.

On MacAir i5 with slightly less then 3000000 observations.

推荐答案

我觉得只是as.Datecharacter转成Date通过 POSIXlt,使用 strptime.我相信 strptime 非常慢.

I think it's just that as.Date converts character to Date via POSIXlt, using strptime. And strptime is very slow, I believe.

要自己跟踪它,输入 as.Date,然后输入 methods(as.Date),然后查看 character 方法.

To trace it through yourself, type as.Date, then methods(as.Date), then look at the character method.

> as.Date
function (x, ...) 
UseMethod("as.Date")
<bytecode: 0x2cf4b20>
<environment: namespace:base>

> methods(as.Date)
[1] as.Date.character as.Date.date      as.Date.dates     as.Date.default  
[5] as.Date.factor    as.Date.IDate*    as.Date.numeric   as.Date.POSIXct  
[9] as.Date.POSIXlt  
   Non-visible functions are asterisked

> as.Date.character
function (x, format = "", ...) 
{
    charToDate <- function(x) {
        xx <- x[1L]
        if (is.na(xx)) {
            j <- 1L
            while (is.na(xx) && (j <- j + 1L) <= length(x)) xx <- x[j]
            if (is.na(xx)) 
                f <- "%Y-%m-%d"
        }
        if (is.na(xx) || !is.na(strptime(xx, f <- "%Y-%m-%d", 
            tz = "GMT")) || !is.na(strptime(xx, f <- "%Y/%m/%d", 
            tz = "GMT"))) 
            return(strptime(x, f))
        stop("character string is not in a standard unambiguous format")
    }
    res <- if (missing(format)) 
        charToDate(x)
    else strptime(x, format, tz = "GMT")       ####  slow part, I think  ####
    as.Date(res)
}
<bytecode: 0x2cf6da0>
<environment: namespace:base>
> 

为什么 as.POSIXlt(Date)$year+1900 比较快?再次,追踪它:

Why is as.POSIXlt(Date)$year+1900 relatively fast? Again, trace it through :

> as.POSIXct
function (x, tz = "", ...) 
UseMethod("as.POSIXct")
<bytecode: 0x2936de8>
<environment: namespace:base>

> methods(as.POSIXct)
[1] as.POSIXct.date    as.POSIXct.Date    as.POSIXct.dates   as.POSIXct.default
[5] as.POSIXct.IDate*  as.POSIXct.ITime*  as.POSIXct.numeric as.POSIXct.POSIXlt
   Non-visible functions are asterisked

> as.POSIXlt.Date
function (x, ...) 
{
    y <- .Internal(Date2POSIXlt(x))
    names(y$year) <- names(x)
    y
}
<bytecode: 0x395e328>
<environment: namespace:base>
> 

很感兴趣,让我们深入研究 Date2POSIXlt.对于这一点,我们需要 grep main/src 来知道要查看哪个 .c 文件.

Intrigued, let's dig into Date2POSIXlt. For this bit we need to grep main/src to know which .c file to look at.

~/R/Rtrunk/src/main$ grep Date2POSIXlt *
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

现在我们知道我们需要寻找 D2POSIXlt :

Now we know we need to look for D2POSIXlt :

~/R/Rtrunk/src/main$ grep D2POSIXlt *
datetime.c:SEXP attribute_hidden do_D2POSIXlt(SEXP call, SEXP op, SEXP args, SEXP env)
names.c:{"Date2POSIXlt",do_D2POSIXlt,   0,  11, 1,  {PP_FUNCALL, PREC_FN,   0}},
$

哦,我们可以猜到 datetime.c.无论如何,看看最新的现场副本:

Oh, we could have guessed datetime.c. Anyway, so looking at latest live copy :

datetime.c

在那里搜索 D2POSIXlt,您会发现从 Date(数字)到 POSIXlt 是多么简单.您还将看到 POSIXlt 如何是一个实数向量(8 个字节)加上七个整数向量(每个 4 个字节).每个日期 40 个字节!

Search in there for D2POSIXlt and you'll see how simple it is to go from Date (numeric) to POSIXlt. You'll also see how POSIXlt is one real vector (8 bytes) plus seven integer vectors (4 bytes each). That's 40 bytes, per date!

所以问题的症结(我认为)是为什​​么 strptime 这么慢,也许可以在 R 中改进.或者直接避免 POSIXlt或间接.

So the crux of the issue (I think) is why strptime is so slow, and maybe that can be improved in R. Or just avoid POSIXlt, either directly or indirectly.

这是一个使用相关项目数 (3,000,000) 的可重现示例:

Here's a reproducible example using the number of items stated in question (3,000,000) :

> Range = seq(as.Date("2000-01-01"),as.Date("2012-01-01"),by="days")
> Date = format(sample(Range,3000000,replace=TRUE),"%m/%d/%Y")
> system.time(as.Date(Date, "%m/%d/%Y"))
   user  system elapsed 
 21.681   0.060  21.760 
> system.time(strptime(Date, "%m/%d/%Y"))
   user  system elapsed 
 29.594   8.633  38.270 
> system.time(strptime(Date, "%m/%d/%Y", tz="GMT"))
   user  system elapsed 
 19.785   0.000  19.802 

传递 tz 似乎可以加快 strptime,而 as.Date.character 确实如此.所以也许这取决于你的语言环境.但是 strptime 似乎是罪魁祸首,而不是 data.table.或许重新运行这个例子,看看你的机器上是否需要 90 秒?

Passing tz appears to speed up strptime, which as.Date.character does. So maybe it depends on your locale. But strptime appears to be the culprit, not data.table. Perhaps rerun this example and see if it takes 90 seconds for you on your machine?

这篇关于为什么 as.Date 在字符向量上很慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆