有效地转换data.table中的日期列 [英] Efficiently convert a date column in data.table

查看:93
本文介绍了有效地转换data.table中的日期列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型数据集,其中许多列包含两种不同格式的日期:

I have a large data set with many columns containing dates in two different formats:

"1996-01-04" "1996-01-05" "1996-01-08" "1996-01-09" "1996-01-10" "1996-01-11"

"02/01/1996" "03/01/1996" "04/01/1996" "05/01/1996" "08/01/1996" "09/01/1996"

在两种情况下,class()均为字符。由于数据集有很多行(450万行),因此我正在寻找一种有效的data.table转换方法。现在,我使用此自建函数:

In both cases, the class() is "character". Since the data set has many rows (4.5 million), I am looking for an efficient data.table conversion method. Right now, I use this self-built function:

convert_to_date <- function(in_array){
  tmp <- try(as.Date(in_array, format = "%d/%m/%Y"),TRUE)
  if (all(!is.na(tmp)) & class(tmp) != "try-error"){
    return(tmp)
  } else{
    tmp2 <- try(as.Date(in_array),TRUE)
    if (all(!is.na(tmp2)) & class(tmp2) != "try-error"){
      return(tmp2)
    } else{
      return(in_array)
    }
  }
}

然后我用它转换列(数据)。表DF)

With which I then convert the columns (of data.table DF) that I need by

DF[,date:=convert_to_date(date)]

但是,这仍然非常慢(每列将近45s)。

This is, however, still incredibly slow (nearly 45s per column).

是否有任何方法可以通过data.table方法进行优化?到目前为止,我还没有找到更好的方法,因此我将感谢您提供任何提示。

Is there any way in optimising this via data.table methods? So far I have not found a better way, so I would be thankful for any tips.

PS:为了提高可读性,我将功能外包了第二遍文件并在我的主例程中获取它。

P.S: For better readability, I have 'outsourced' the function to a second file and sourced it in my main routine. Does that have a (negative) significant impact on computation speed in R?

推荐答案

根据此基准测试,这是将标准格式的字符日期( YYYY-MM-DD )转换为最快的方法进入类 Date 是使用 as.Date(fasttime :: fastPOSIXct())

According to this benchmark, the fastest method to convert character dates in standard unambiguous format (YYYY-MM-DD) into class Date is to use as.Date(fasttime::fastPOSIXct()).

不幸的是,这需要事先测试格式,因为您的其他格式 DD / MM / YYYY fasttime :: fastPOSIXct( )

Unfortunately, this requires to test the format beforehand because your other format DD/MM/YYYY is misinterpreted by fasttime::fastPOSIXct().

因此,如果您不想打扰每个日期列的格式,可以随时使用 :: anydate()函数:

So, if you don't want to bother about the format of each date column you may use the anytime::anydate() function:

# sample data
df <- data.frame(
    X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), 
    X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), 
    stringsAsFactors = FALSE)

library(data.table)
# convert date columns
date_cols <- c("X1", "X2")
setDT(df)[, (date_cols) := lapply(.SD, anytime::anydate), .SDcols = date_cols]
df



           X1         X2
1: 1996-01-04 1996-02-01
2: 1996-01-05 1996-03-01
3: 1996-01-08 1996-04-01
4: 1996-01-09 1996-05-01
5: 1996-01-10 1996-08-01
6: 1996-01-11 1996-09-01





基准测试时间表明,在提供的便利性之间要进行权衡随时打包和性能。因此,如果速度至关重要,则没有其他方法可以测试每一列的格式并使用可用于该格式的最快转换方法。


The benchmark timings show that there is a trade off between the convenience offered by the anytime package and performance. So if speed is crucial, there is no other way to test the format of each column and to use the fastest conversion method available for the format.

OP使用了 try()函数用于此目的。下面的解决方案使用正则表达式查找与给定格式匹配的所有列(仅使用第1行以节省时间)。这样做还有一个好处,即相关列的名称是自动确定的,不需要键入。

The OP has used the try() function for this purpose. The solution below uses regular expressions to find all columns which match a given format (only row 1 is used to save time). This has the additional benefit that the names of the relevant columns are determined automatically and need not to be typed in.

# enhanced sample data with additional columns
df <- data.frame(
    X1 = c("1996-01-04", "1996-01-05", "1996-01-08", "1996-01-09", "1996-01-10", "1996-01-11"), 
    X2 = c("02/01/1996", "03/01/1996", "04/01/1996", "05/01/1996", "08/01/1996", "09/01/1996"), 
    X3 = "other data",
    X4 = 1:6,
    stringsAsFactors = FALSE)

library(data.table)
options(datatable.print.class = TRUE)

# coerce to data.table
setDT(df)[]
# convert date columns in standard unambiguous format YYYY-MM-DD
date_cols1 <- na.omit(names(df)[
  df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{4}-\\d{2}-\\d{2}"),]])
# use fasttime package
df[, (date_cols1) := lapply(.SD, function(x) as.Date(fasttime::fastPOSIXct(x))), 
   .SDcols = date_cols1]
# convert date columns in DD/MM/YYYY format
date_cols2 <- na.omit(names(df)[
  df[1, sapply(.SD, stringr::str_detect, pattern = "\\d{2}/\\d{2}/\\d{4}"),]])
# use lubridate package
df[, (date_cols2) := lapply(.SD, lubridate::dmy), .SDcols = date_cols2]
df



           X1         X2         X3    X4
       <Date>     <Date>     <char> <int>
1: 1996-01-04 1996-01-02 other data     1
2: 1996-01-05 1996-01-03 other data     2
3: 1996-01-08 1996-01-04 other data     3
4: 1996-01-09 1996-01-05 other data     4
5: 1996-01-10 1996-01-08 other data     5
6: 1996-01-11 1996-01-09 other data     6



注意事项


如果其中一个日期列在第一行中确实包含 NA ,则该列可能会未经转换而转义。为了处理这些情况,需要修改以上代码。

Caveat

In case one of the date columns does contain NA in the first row, this column may escape unconverted. To handle these cases, the above code needs to be amended.

这篇关于有效地转换data.table中的日期列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆