R从半标准字符串中提取时间分量 [英] R extract time components from semi-standard strings

查看:24
本文介绍了R从半标准字符串中提取时间分量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一列持续时间以字符串形式存储在数据框中.我想将它们转换为适当的时间对象,可能是 POSIXlt.大多数字符串很容易解析使用 这个方法:

I have a column of durations stored as a strings in a dataframe. I want to convert them to an appropriate time object, probably POSIXlt. Most of the strings are easy to parse using this method:

> data <- data.frame(time.string = c(
+   "1 d 2 h 3 m 4 s",
+   "10 d 20 h 30 m 40 s",
+   "--"))
> data$time.span <- strptime(data$time.string, "%j d %H h %M m %S s")
> data$time.span
[1] "2012-01-01 02:03:04" "2012-01-10 20:30:40" NA

缺失的持续时间被编码为 "--" 并且需要转换为 NA - 这已经发生了,但应该保留.

Missing durations are coded "--" and need to be converted to NA - this already happens but should be preserved.

挑战在于字符串丢弃零值元素.因此,所需的值 2012-01-01 02:00:14 将是字符串 "1 d 2 h 14 s".然而,这个字符串使用简单的解析器解析为 NA:

The challenge is that the string drops zero-valued elements. Thus the desired value 2012-01-01 02:00:14 would be the string "1 d 2 h 14 s". However this string parses to NA with the simple parser:

> data2 <- data.frame(time.string = c(
+  "1 d 2 h 14 s",
+  "10 d 20 h 30 m 40 s",
+  "--"))
> data2$time.span <- strptime(data2$time.string, "%j d %H h %M m %S s")
> data2$time.span
[1] NA "2012-01-10 20:30:40" NA

问题

  1. 处理所有可能的字符串格式的R 方式"是什么?也许单独测试并提取每个元素,然后重新组合?
  2. POSIXlt 是正确的目标类吗?我需要不受任何特定开始时间影响的持续时间,因此添加虚假的年月数据 (2012-01-) 令人不安.
  1. What is the "R Way" to handle all the possible string formats? Perhaps test for and extract each element individually, then recombine?
  2. Is POSIXlt the right target class? I need duration free from any specific start time, so the addition of false year and month data (2012-01-) is troubling.

解决方案

@mplourde 绝对有正确的想法,基于测试日期格式中的各种条件动态创建格式化字符串.添加 cut(Sys.Date(),break='years') 作为 datediff 的基线也很好,但未能解决关键的怪癖as.POSIXct() 注意:我使用的是 R2.11 基础,这可能已在以后的版本中修复.

Solution

@mplourde definitely had the right idea w/ dynamic creation of a formatting string based on testing various conditions in the date format. The addition of cut(Sys.Date(), breaks='years') as the baseline for the datediff was also good, but failed to account for a critical quirk in as.POSIXct() Note: I'm using R2.11 base, this may have been fixed in later versions.

as.POSIXct() 的输出根据是否包含日期组件而显着变化:

The output of as.POSIXct() changes dramatically depending on whether or not a date component is included:

> x <- "1 d 1 h 14 m 1 s"
> y <-     "1 h 14 m 1 s"  # Same string, no date component
> format (x)  # as specified below
[1] "%j d %H h %M m %S s"
> format (y)
[1] "% H h % M %S s"    
> as.POSIXct(x,format=format)  # Including the date baselines at year start
[1] "2012-01-01 01:14:01 EST"
> as.POSIXct(y,format=format)  # Excluding the date baselines at today start
[1] "2012-06-26 01:14:01 EDT"

因此 difftime 函数的第二个参数应该是:

Thus the second argument for the difftime function should be:

  • 如果输入字符串天组件
  • ,则当前年份的第一天的开始
  • 当前天的开始,如果输入字符串没有有天组件
  • The start of the first day of the current year if the input string has a day component
  • The start of the current day if the input string does not have a day component

这可以通过改变cut函数上的单位参数来实现:

This can be accomplished by changing the unit parameter on the cut function:

parse.time <- function (x) {
  x <- as.character (x)
  break.unit <- ifelse(grepl("d",x),"years","days")  # chooses cut() unit
  format <- paste(c(if (grepl("d", x)) "%j d",
                    if (grepl("h", x)) "%H h",
                    if (grepl("m", x)) "%M m",
                    if (grepl("s", x)) "%S s"), collapse=" ")

  if (nchar(format) > 0) {
    difftime(as.POSIXct(x, format=format), 
             cut(Sys.Date(), breaks=break.unit),
             units="hours")
  } else {NA}

}

推荐答案

difftime 对象是持续时间对象,可以添加到 POSIXctPOSIXlt 对象.也许你想用它来代替 POSIXlt?

difftime objects are time duration objects that can be added to either POSIXct or POSIXlt objects. Maybe you want to use this instead of POSIXlt?

关于从字符串到时间对象的转换,你可以这样做:

Regarding the conversion from strings to time objects, you could do something like this:

data <- data.frame(time.string = c(
    "1 d 1 h",
    "30 m 10 s",
    "1 d 2 h 3 m 4 s",
    "2 h 3 m 4 s",
    "10 d 20 h 30 m 40 s",
    "--"))

f <- function(x) {
    x <- as.character(x)
    format <- paste(c(if (grepl('d', x)) '%j d',
                      if (grepl('h', x)) '%H h',
                      if (grepl('m', x)) '%M m',
                      if (grepl('s', x)) '%S s'), collapse=' ')

    if (nchar(format) > 0) {
        if (grepl('%j d', format)) {
            # '%j 1' is day 0. We add a day so that x = '1 d' means 24hrs.
            difftime(as.POSIXct(x, format=format) + as.difftime(1, units='days'), 
                    cut(Sys.Date(), breaks='years'),
                    units='hours')
        } else {
            as.difftime(x, format, units='hours')
        }
    } else { NA }
}

data$time.span <- sapply(data$time.string, FUN=f)

这篇关于R从半标准字符串中提取时间分量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆