如何在大型数据帧中快速转换不同的时间格式？ [英] How to fast convert different time formats in large data frames?

查看：103 发布时间：2020/10/19 1:27:19 r performance datetime-format

本文介绍了如何在大型数据帧中快速转换不同的时间格式？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想计算不同时间维度的长度，但是在数据框列中处理两种略有不同的时间格式时遇到问题。

I want to calculate length in different time dimensions but I have problems dealing with the two slightly different time formats in my data frame column.

原始数据框列大约有一百万行，两种格式（如示例代码所示）混合在一起。

The original data frame column has about a million rows with the two formats (shown in the example code) mixed up .

示例代码：

time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z",
         "2018-10-04T12:13:41.333Z", "2018-10-04T12:13:45.479Z")

length <- c(15.8, 132.1, 12.5, 33.2)

df <- data.frame(time, length)

df$time <- format(as.POSIXlt(strptime(df$time,"%Y-%m-%dT%H:%M:%SZ", tz="")))
df

格式 2018-10-04T12：13：41.333Z 和 2018-10-04T12：13：45.479Z 导致 NA 。

是否有一种解决方案也适用于两种格式混合在一起的大数据框？

Is there a solution that would also be applicable to a big data frame where the two formats are mixed up?

推荐答案

我们可以使用％OS 代替％S 来计算秒数。

We may use %OS instead of %S to account for decimals in seconds.

help("strptime")

R的特定值是％OSn ，对于输出，其秒数被截断为
0< ; = n< =小数点后6位（如果％OS后没有数字，则
使用getOption（ digits.secs）的设置，或者如果未设置，则n =
0）。

Specific to R is %OSn, which for output gives the seconds truncated to 0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it uses the setting of getOption("digits.secs"), or if that is unset, n = 0).

as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"

这个基本的R代码比打包解决方案要快，请自己尝试。

This base R code is considerably faster than the package solutions, try it yourself.

time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")

这个比较棘手。 ？strptime 表示我们应该使用％z 来抵消UTC，但以某种方式不能用于 as.POSIXct 。相反，我们可以这样做，

This one is trickier. ?strptime says we should use %z for offsets from UTC, but somehow it won't work with as.POSIXct. Instead we could do this,

as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
  {os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
  (os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"

将从字符串中删除不可读的部分，将其转换为秒并将其添加到 POSIXct 对象。

which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct" object.

如果只有小时，如 time2 ，我们也可以这样说：

If there are only hours as in time2, we could also say:

as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
  as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"

现在的代码稍长，不应掩盖它实际上与答案顶部的运行速度一样快的事实。

That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.

您可以将当前的三个变体包装到函数具有 if（nchar（x）== 29）... else 结构，例如：

You could wrap the current three variants into a function with if (nchar(x) == 29) ... else structure, such as this one:

fixDateTime <- function(x) {
  s <- split(x, nchar(x))
  if ("20" %in% names(s))
    s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
  else if ("24" %in% names(s))
    s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
  else if ("29" %in% names(s))
    s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") + 
      {os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
      (os[1]*60 + os[2])*60}
  return(unsplit(s, nchar(x)))
}

res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"

仅与软件包 fixDateTime 可以处理所有三种定义的日期时间类型。根据最终基准，该功能仍然非常快。

Compared to the packages only fixDateTime can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.

注意： 如果日期不同，该函数在逻辑上会失败格式具有相同的 nchar ，并且应根据情况进行自定义（例如，通过另一个 split 条件）！未测试：向 POSIXct 添加秒数的夏时制。

Note: The function logically fails if different date formats have the same nchar, and it should be customized in the case (e.g. by another split condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct.

# Unit: milliseconds
#        expr       min        lq      mean    median        uq       max neval  cld
# fixDateTime  35.46387  35.94761  40.07578  36.05923  39.54706  68.46211    10   c 
#  as.POSIXct  20.32820  20.45985  21.00461  20.62237  21.16019  23.56434    10  b   # to compare
#   lubridate  11.59311  11.68956  12.88880  12.01077  13.76151  16.54479    10 a    # produces NAs! 
#     anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272    10    d # produces NAs!

数据

Data

time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z", 
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z") 
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z", 
           "2018-10-01T11:42:37.000+03:00")

基准代码

Benchmark code

n <-  1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)

library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
                               as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
                               lubridate=parse_date_time(t2, "ymd_HMS"),
                               anytime=anytime(t2),
                               times=10L)

这篇关于如何在大型数据帧中快速转换不同的时间格式？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在大型数据帧中快速转换不同的时间格式？ [英] How to fast convert different time formats in large data frames?

问题描述

推荐答案

数据

Data

基准代码

Benchmark code

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何在大型数据帧中快速转换不同的时间格式？ [英] How to fast convert different time formats in large data frames?

问题描述

推荐答案

数据

Data

基准代码

Benchmark code

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭