根据时间序列对数据帧进行子集 [英] Subset a data frame based on a time sequence

查看:74
本文介绍了根据时间序列对数据帧进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个名为 DF 的数据框,其中包含时间和日期列.我想根据这些列中的值对 DF 进行子集化.对于日期,我在 DATES 中有一个日期列表,并且正在对 DATES 中存在 DF$Date 的 DF 行进行子集化.目前,我想从 00:04:00 到 00:06:00 进行子集化.我不知道如何做后者.

I have a data frame called DF with columns Time and Date. I would like to subset DF based on values in those columns. For the dates, I have a list of dates in DATES, and am subsetting DF rows where DF$Date is present in DATES. For the time, I would like to subset from 00:04:00 to 00:06:00. I do not know how to do the latter.

理想情况下,我想通过指定范围(如 00:04:00 到 00:06:00)以及指定起点和向前看的分钟数(如 00:04:00)来进行子集化和 3 分钟(两种不同的方法).我想这一切都归结为制作一个时间序列,并将这样的序列放在一个单独的向量中,然后我可以将其用于匹配.

Ideally I would like to subset both by specifying the range, as in 00:04:00 to 00:06:00, as well as by specifying the starting point and minutes to look ahead, as in 00:04:00 and 3 minutes (two different methods). I guess it all comes down to making a sequence of time, and placing such sequence in a separate vector, which I can then use for matching.

请注意,这只是一个可重现的简短示例.我正在寻找一种通用的方法来做到这一点,因为在实践中我想对大的时间跨度进行子集化.另请注意,即使在此示例中只有一个匹配日期,但实际上会有许多跨越多年的匹配日期.这就是为什么我认为不可能使用 POSIXlt 来制作时间序列.非常感谢.

Please note this is simply a short reproducible example. I am looking for a general way to do this because in practice I want to subset large spans of time. Please also note that even though in this example there is only one matching date, in practice there would be many matching dates spanning multiple years. That is why I believe it is not possible to use POSIXlt for making the time sequence. Thank you very much.

  #DF looks like this:
  #               DateTime  XXX      Time      Date
  #1371 2016-04-25 00:08:00  14 00:08:00 2016-04-25
  #1372 2016-04-25 00:07:00  13 00:07:00 2016-04-25
  #1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
  #1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
  #1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25
  #1376 2016-04-25 00:03:00   4 00:03:00 2016-04-25
  #1377 2016-04-25 00:02:00   6 00:02:00 2016-04-25
  #1387 2016-04-24 23:52:00  41 23:52:00 2016-04-24
  #1388 2016-04-24 23:51:00  93 23:51:00 2016-04-24
  #1389 2016-04-24 23:50:00  53 23:50:00 2016-04-24

  #Code for DF, DATES, and to subset DF based on DATES
  DF <- structure(list(DateTime = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L, 0L, 59L, 58L, 57L, 56L, 55L, 54L, 53L, 52L, 51L, 50L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L, 23L), mday = c(25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 25L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L), mon = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L, 116L), wday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), yday = c(115L, 115L, 115L, 115L, 115L, 115L, 115L, 115L, 115L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L, 114L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), zone = c("EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT", "EDT"), gmtoff = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Open = c(14, 13, 14, 3, 2, 4, 6, 4, 15, 15, 23, 24, 33, 14, 65, 54, 41, 93, 53), Time = c("00:08:00", "00:07:00", "00:06:00", "00:05:00", "00:04:00", "00:03:00", "00:02:00", "00:01:00", "00:00:00", "23:59:00", "23:58:00", "23:57:00", "23:56:00", "23:55:00", "23:54:00", "23:53:00", "23:52:00", "23:51:00", "23:50:00"), Date = structure(c(16916, 16916, 16916, 16916, 16916, 16916, 16916, 16916, 16916, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915, 16915), class = "Date")), .Names = c("DateTime", "XXX", "Time", "Date"), row.names = c("1371", "1372", "1373", "1374", "1375", "1376", "1377", "1378", "1379", "1380", "1381", "1382", "1383", "1384", "1385", "1386", "1387", "1388", "1389"), class = "data.frame")
  DATES <- structure(c(12431, 12432, 10445, 10480, 11487, 12494, 12501, 12508, 13115, 13522, 14529, 15536, 16916, 16935), class = "Date")
  SELEC <- DF[DF$Date %in% DATES,]

  #Result of subsetting by Date:
  #                 DateTime XXX     Time       Date
  # 1371 2016-04-25 00:08:00  14 00:08:00 2016-04-25
  # 1372 2016-04-25 00:07:00  13 00:07:00 2016-04-25
  # 1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
  # 1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
  # 1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25
  # 1376 2016-04-25 00:03:00   4 00:03:00 2016-04-25
  # 1377 2016-04-25 00:02:00   6 00:02:00 2016-04-25
  # 1378 2016-04-25 00:01:00   4 00:01:00 2016-04-25
  # 1379 2016-04-25 00:00:00  15 00:00:00 2016-04-25

  #How the final product would look like if using a larger data base spanning many years:
  #           DateTime XXX     Time       Date
  #2016-04-25 00:06:00  13 00:06:00 2016-04-25
  #2016-04-25 00:05:00  14 00:05:00 2016-04-25
  #2016-04-25 00:04:00   3 00:04:00 2016-04-25
  #2014-03-11 00:06:00  94 00:06:00 2014-03-11
  #2014-03-11 00:05:00   6 00:05:00 2014-03-11
  #2014-03-11 00:04:00  14 00:04:00 2014-03-11
  #2011-08-06 00:06:00  13 00:06:00 2011-08-06
  #2011-08-06 00:05:00  19 00:05:00 2011-08-06
  #2011-08-06 00:04:00  41 00:04:00 2011-08-06

推荐答案

可能性 1:字典比较

如果所有时间值都存储为具有相同分隔符的零填充 24 小时字符串,例如 %H:%M:%S,则可以使用字典比较来应用过滤器.

If all time values are stored as zero-padded 24 hour strings with the same delimiters, such as %H:%M:%S, then a lexicographic comparison can be used to apply the filter.

DF[DF$Date%in%DATES & DF$Time>='00:04:00' & DF$Time<='00:06:00',];
##                 DateTime XXX     Time       Date
## 1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25

当然,字典解决方案并不理想,因为它们不适合基于时间的数学运算,例如加、减、乘、除等.

Lexicographic solutions are, of course, not ideal, because they do not lend themselves to time-based math, such as adding, subtracting, multiplying, dividing, etc.

更好的解决方案包括将时间值转换为数字类型,该类型将持续时间编码为与显式或未指定基准时间的偏移量.这是流行的日期/时间库编码类型的方式,例如 为 C++ 提升 date_timeJoda-Time 用于 Java,POSIXct、difftime 和 lubridate 用于 R.

Better solutions involve transforming the time values to a numerical type that encodes time durations as an offset from an explicit or unspecified base time. This is how popular date/time libraries encode types, such as boost date_time for C++, Joda-Time for Java, and POSIXct, difftime, and lubridate for R.

可能性 2:手动数字

可以自己解析字符串来构造表示持续时间的数字,并使用数字比较来应用过滤器.

It's possible to parse the strings ourselves to construct numerics representing the time durations, and use numerical comparison to apply the filter.

hmsToDouble <- function(hms) as.double(substr(hms,1,2))*3600 + as.double(substr(hms,4,5))*60 + as.double(substr(hms,7,8));
DF[DF$Date%in%DATES & hmsToDouble(DF$Time)>=hmsToDouble('00:04:00') & hmsToDouble(DF$Time)<=hmsToDouble('00:06:00'),];
##                 DateTime XXX     Time       Date
## 1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25

<小时>

可能性 3:POSIXt

我们可以生成 POSIXt(即 POSIXct 或 POSIXlt)值的向量,并对这些向量使用向量化比较.

We can generate vectors of POSIXt (that is, POSIXct or POSIXlt) values and use vectorized comparisons against these vectors.

DF[DF$Date%in%DATES & DF$DateTime>=as.POSIXct(paste0(DF$Date,' 00:04:00')) & DF$DateTime<=as.POSIXct(paste0(DF$Date,' 00:06:00')),];
##                 DateTime XXX     Time       Date
## 1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25

<小时>

可能性 4:difftime

R 中唯一的内置持续时间数据类型是 difftime 类型,使用起来可能有点挑剔.但对于这个问题,它相当简单.

The only built-in time duration data type in R is the difftime type, which can be a little bit finicky to work with. But for this problem, it's fairly straightforward.

DF[DF$Date%in%DATES & as.difftime(DF$Time)>=as.difftime('00:04:00') & as.difftime(DF$Time)<=as.difftime('00:06:00'),];
##                 DateTime XXX     Time       Date
## 1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25

<小时>

可能性 5:润滑

lubridate 包被广泛认为是最好的R 中用于日期/时间处理的包.它提供了一个表示规则持续时间的持续时间类型,以及一个允许表示各种不规则时间单位计数的周期类型.从历史上看,日期/时间库有时会失败,因为它们缺乏对不规则时间段和规则时间段之间的区别的认识.

The lubridate package is widely considered to be the best package for date/time handling in R. It provides a duration type which represents regular time durations, and a period type which allows representing counts of various irregular time units. Historically, date/time libraries have sometimes failed because they lacked an appreciation for the distinction between irregular time periods and regular time durations.

在以下解决方案中,hms() 调用返回周期类型的实例,因此我们实际上是在比较不同的时间单位.顺便说一句,关于单独时间单位的实际存储,lubridate 的设计是将秒值存储为双向量的实际有效载荷,其余单位(分钟、小时、天、月和年)作为属性存储在对象.

In the following solution, the hms() calls return instances of the period type, hence we are actually comparing separate time units. Incidentally, with respect to the actual storage of the separate time units, lubridate's design is to store the seconds values as the actual payload of the double vector, and the remaining units (minutes, hours, days, months, and years) as attributes on the object.

library(lubridate);
DF[DF$Date%in%DATES & hms(DF$Time)>=hms('00:04:00') & hms(DF$Time)<=hms('00:06:00'),];
##                 DateTime XXX     Time       Date
## 1373 2016-04-25 00:06:00  14 00:06:00 2016-04-25
## 1374 2016-04-25 00:05:00   3 00:05:00 2016-04-25
## 1375 2016-04-25 00:04:00   2 00:04:00 2016-04-25

这篇关于根据时间序列对数据帧进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆