如何在大型数据帧中快速转换不同的时间格式? [英] How to fast convert different time formats in large data frames?
问题描述
我想计算不同时间维度的长度,但是在数据框列中处理两种略有不同的时间格式时遇到问题。
I want to calculate length in different time dimensions but I have problems dealing with the two slightly different time formats in my data frame column.
原始数据框列大约有一百万行,两种格式(如示例代码所示)混合在一起。
The original data frame column has about a million rows with the two formats (shown in the example code) mixed up .
示例代码:
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z",
"2018-10-04T12:13:41.333Z", "2018-10-04T12:13:45.479Z")
length <- c(15.8, 132.1, 12.5, 33.2)
df <- data.frame(time, length)
df$time <- format(as.POSIXlt(strptime(df$time,"%Y-%m-%dT%H:%M:%SZ", tz="")))
df
格式 2018-10-04T12:13:41.333Z
和 2018-10-04T12:13:45.479Z
导致 NA
。
是否有一种解决方案也适用于两种格式混合在一起的大数据框?
Is there a solution that would also be applicable to a big data frame where the two formats are mixed up?
推荐答案
我们可以使用%OS
代替%S
来计算秒数。
We may use %OS
instead of %S
to account for decimals in seconds.
help("strptime")
R的特定值是
%OSn
,对于输出,其秒数被截断为
0< ; = n< =小数点后6位(如果%OS后没有数字,则
使用getOption( digits.secs)的设置,或者如果未设置,则n =
0)。
Specific to R is
%OSn
, which for output gives the seconds truncated to 0 <= n <= 6 decimal places (and if %OS is not followed by a digit, it uses the setting of getOption("digits.secs"), or if that is unset, n = 0).
as.POSIXct(time, format="%Y-%m-%dT%H:%M:%OSZ")
# [1] "2018-07-29 15:02:05 CEST" "2018-07-29 14:46:57 CEST"
# [3] "2018-10-04 12:13:41 CEST" "2018-10-04 12:13:45 CEST"
这个基本的R代码比打包解决方案要快 ,请自己尝试。
This base R code is considerably faster than the package solutions, try it yourself.
time2 <- c("2018-09-01T12:42:37.000+02:00", "2018-10-01T11:42:37.000+03:00")
这个比较棘手。 ?strptime
表示我们应该使用%z
来抵消UTC,但以某种方式不能用于 as.POSIXct
。相反,我们可以这样做,
This one is trickier. ?strptime
says we should use %z
for offsets from UTC, but somehow it won't work with as.POSIXct
. Instead we could do this,
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(time2, 24), "\\:")))
(os[1]*60 + os[2])*60}
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
将从字符串中删除不可读的部分,将其转换为秒并将其添加到 POSIXct
对象。
which cuts the unreadable part from the string, converts it to seconds and adds it to the "POSIXct"
object.
如果只有 小时,如 time2
,我们也可以这样说:
If there are only hours as in time2
, we could also say:
as.POSIXct(substr(time2, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
as.numeric(substr(time2, 24, 26))*3600
# [1] "2018-09-01 14:42:37 CEST" "2018-10-01 13:42:37 CEST"
现在的代码稍长,不应掩盖它实际上与答案顶部的运行速度一样快的事实。
That the code is slightly longer now should not obscure the fact that it runs practically as fast as the one at top of the answer.
您可以将当前的三个变体包装到函数具有 if(nchar(x)== 29)... else
结构,例如:
You could wrap the current three variants into a function with if (nchar(x) == 29) ... else
structure, such as this one:
fixDateTime <- function(x) {
s <- split(x, nchar(x))
if ("20" %in% names(s))
s$`20` <- as.POSIXct(s$`20` , format="%Y-%m-%dT%H:%M:%SZ")
else if ("24" %in% names(s))
s$`24` <- as.POSIXct(s$`24`, format="%Y-%m-%dT%H:%M:%OSZ")
else if ("29" %in% names(s))
s$`29` <- as.POSIXct(substr(s$`29`, 1, 23), format="%Y-%m-%dT%H:%M:%OS") +
{os <- as.numeric(el(strsplit(substring(s[[3]], 24), "\\:")))
(os[1]*60 + os[2])*60}
return(unsplit(s, nchar(x)))
}
res <- fixDateTime(time3)
res
# [1] "2018-07-29 15:02:05 CEST" "2018-10-04 00:00:00 CEST" "2018-10-01 00:00:00 CEST"
str(res)
# POSIXct[1:3], format: "2018-07-29 15:02:05" "2018-10-04 00:00:00" "2018-10-01 00:00:00"
仅与软件包 fixDateTime
可以处理所有三种定义的日期时间类型。根据最终基准,该功能仍然非常快。
Compared to the packages only fixDateTime
can handle all three defined date-time types. According to the concluding benchmark the function is still very fast.
注意: 如果日期不同,该函数在逻辑上会失败格式具有相同的 nchar
,并且应根据情况进行自定义(例如,通过另一个 split
条件)!未测试:向 POSIXct
添加秒数的夏时制。
Note: The function logically fails if different date formats have the same nchar
, and it should be customized in the case (e.g. by another split
condition)! Not tested: daylight saving time behavior when adding seconds to POSIXct
.
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# fixDateTime 35.46387 35.94761 40.07578 36.05923 39.54706 68.46211 10 c
# as.POSIXct 20.32820 20.45985 21.00461 20.62237 21.16019 23.56434 10 b # to compare
# lubridate 11.59311 11.68956 12.88880 12.01077 13.76151 16.54479 10 a # produces NAs!
# anytime 198.57292 201.06483 203.95131 202.91368 203.62130 212.83272 10 d # produces NAs!
数据
Data
time <- c("2018-07-29T15:02:05Z", "2018-07-29T14:46:57Z", "2018-10-04T12:13:41.333Z",
"2018-10-04T12:13:45.479Z")
time2 <- c("2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z", "2018-07-29T15:02:05Z")
time3 <- c("2018-07-29T15:02:05Z", "2018-10-04T12:13:41.333Z",
"2018-10-01T11:42:37.000+03:00")
基准代码
Benchmark code
n <- 1e3
t1 <- sample(time2, n, replace=TRUE)
t2 <- sample(time3, n, replace=TRUE)
library(lubridate)
library(anytime)
microbenchmark::microbenchmark(fixDateTime=fixDateTime(t2),
as.POSIXct=as.POSIXct(t1, format="%Y-%m-%dT%H:%M:%OSZ"),
lubridate=parse_date_time(t2, "ymd_HMS"),
anytime=anytime(t2),
times=10L)
这篇关于如何在大型数据帧中快速转换不同的时间格式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!