随时间间隔合并记录 [英] Merge Records Over Time Interval

查看:111
本文介绍了随时间间隔合并记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先让我说这个问题与R(状态编程语言)有关,但是我对其他环境提出了直接的建议.

Let me begin by saying this question pertains to R (stat programming language) but I'm open straightforward suggestions for other environments.

目标是将数据帧(df)A的结果合并到df B中的子元素.这是一对多的关系,但是,这是曲折,一旦记录被键匹配他们还必须在由开始时间和持续时间指定的特定时间范围内进行匹配.

The goal is to merge outcomes from dataframe (df) A to sub-elements in df B. This is a one to many relationship but, here's the twist, once the records are matched by keys they also have to match over a specific frame of time given by a start time and duration.

例如,df A中的一些记录:

For example, a few records in df A:

    OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal

从df B:

    OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00  

合并所需的结果将是:

    OBS ID Time     Outcome  
    1   01 10:12:10 Normal 
    3   02 10:12:45 Weird 

所需结果:具有从A合并的结果的数据框B.由于观察值2和4匹配了A中记录的ID,但它们不在给定的任何时间间隔内,因此删除了观察值2和

Desired result: dataframe B with outcomes merged in from A. Notice observations 2 and 4 were dropped because although they matched IDs on records in A they did not fall within any of the time intervals given.

问题

是否可以在R中执行这种操作,您将如何开始?如果没有,您可以建议替代工具吗?

Is it possible to perform this sort of operation in R and how would you get started? If not, can you suggest an alternative tool?

推荐答案

设置数据

首先设置输入数据帧.我们创建数据帧的两个版本:AB仅使用字符列作为时间,而AtBt使用chron包"times"类进行时间(与可以添加和减去它们的类):

First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):

LinesA <- "OBS ID StartTime Duration Outcome 
    1   01 10:12:06  00:00:10 Normal
    2   02 10:12:30  00:00:30 Weird
    3   01 10:15:12  00:01:15 Normal
    4   02 10:45:00  00:00:02 Normal"

LinesB <- "OBS ID Time       
    1   01 10:12:10  
    2   01 10:12:17  
    3   02 10:12:45  
    4   01 10:13:00"

A <- At <- read.table(textConnection(LinesA), header = TRUE, 
               colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE, 
               colClasses = c("numeric", rep("character", 2)))

# in At and Bt convert times columns to "times" class

library(chron) 

At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)

带时间类别的sqldf

现在,我们可以使用 sqldf 程序包执行计算了.我们使用method="raw"(不会为输出分配类),因此我们必须自己为输出"Time"列分配"times"类:

Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:

library(sqldf)

out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
   where Time between StartTime and StartTime + Duration",
   method = "raw")

out$Time <- times(as.numeric(out$Time))

结果是:

> out
      OBS ID     Time Outcome
1   1 01 10:12:10  Normal
2   3 02 10:12:45   Weird

使用sqldf的开发版本,无需使用method="raw"即可完成操作,并且sqldf类分配试探法会自动将"Time"列设置为"times"类:

With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:

library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver 
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
    where Time between StartTime and StartTime + Duration")

具有字符类的sqldf

通过使用sqlite的

Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:

sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
    where strftime('%s', Time) - strftime('%s', StartTime)
       between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")

一系列修改,修复了语法,增加了其他方法并修复/改进了read.table语句的情况.

A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.

简化/改进的最终sqldf语句.

Simplified/improved final sqldf statement.

这篇关于随时间间隔合并记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆