如何通过日期/时间值不匹配的公共列合并 r 中的两个数据框 [英] How to merge two data frames in r by a common column with mismatched date/time values

查看:18
本文介绍了如何通过日期/时间值不匹配的公共列合并 r 中的两个数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望使用两者中都存在的日期时间"列来合并两个数据集(POSIXct 格式:dd/mm/yyyy hh:mm).以下是来自两个数据集的示例数据:

I wish to merge two datasets using ‘date time’ columns that are present in both (POSIXct format: dd/mm/yyyy hh:mm). Below is example data from the two datasets:

# Dataset 1 (dts1)

           datetime   count   period   
1  30/03/2011 02:32      27      561
2  30/03/2011 02:42       3      600
3  30/03/2011 02:52       0      574
4  30/03/2011 03:02       1      550
5  30/03/2011 03:12      15      600
6  30/03/2011 03:22       0      597

# Dateset 2 (dts2)

   datetime         dist car   satd      alt
1 30/03/2011 01:59  23.9   1      3     1.76       
2 30/03/2011 02:58  14.7   1      7     6.36       
3 30/03/2011 03:55  10.4   2      9    -0.34      
4 30/03/2011 04:53  35.4   1      3     3.55      
5 30/03/2011 05:52  56.1   1      7    -0.91       
6 30/03/2011 06:48  12.3   1      4     6.58      
7 30/03/2011 07:48  10.7   1      5     4.18      

如果是从两个帧合并匹配行的简单情况,那么基本的 merge(data1, data2, by="datetime")rbind() 函数可用于.

If it was a simple case of merging matching rows from the two frames then basic merge(data1, data2, by="datetime") or rbind() function could be used.

然而,我的问题更复杂,因为两个数据集中的时间间隔不相等.Dataset 1 包含 精确 10 分钟间隔的数据(每行包含在指定日期/时间结束的 10 分钟块的信息),而 dataset2 包含大约 1 小时间隔的数据(每行包含来自在指定日期/时间结束的 1 小时块的信息).

However, my problem is more complicated as the time intervals in the two datasets are not equal. Dataset 1 contains data at precise 10-minute intervals (each row incorporates information on the 10-minute block that ends at the specified date/time), while dataset 2 contains data at approximate 1-hour intervals (each row incorporates information from the 1-hour block that ends at the specified date/time).

更难的是,两个数据集中行的开始时间之间存在时间不匹配(即 dts1: 01/03/2013 10:00:00, dts2: 01/03/2012 09:58:12).dts2 间隔在整个数据集中也有所不同(± 1 小时左右的几分钟).我想将数据集 1 中的每 10 分钟数据行与它适合数据集 2 的小时块(以及来自 dts2 的所有相关列值)链接起来.将有一些行可以放入 2 个不同的小时块(即 30/03/2011 03:02),但我只需要将这些行链接到一个小时块.

To make things further harder, there is a time mismatch between the start times of the rows in the two datasets (i.e. dts1: 01/03/2013 10:00:00, dts2: 01/03/2012 09:58:12). dts2 intervals also vary throughout the dataset (± few minutes either side of 1 hour). I want to link every 10-minute data row in dataset 1 with the hour block that it fits within in dataset 2 (along with all the associated column values from dts2). There will be some rows that could fit into 2 different hour blocks (i.e. 30/03/2011 03:02), but I only need these rows to be linked to one of the hour blocks.

我想以这样的方式结束:

I would like to end up with something like this:

        datetime_dts1 count period     datetime2_dts2  dist  car satd      alt  
1    30/03/2011 02:32    27    561   30/03/2011 02:58  14.7   1     7     6.36      
2    30/03/2011 02:42     3    600   30/03/2011 02:58  14.7   1     7     6.36
3    30/03/2011 02:52     0    574   30/03/2011 02:58  14.7   1     7     6.36
4    30/03/2011 03:02     1    550   30/03/2011 02:58  14.7   1     7     6.36
5    30/03/2011 03:12    15    600   30/03/2011 03:55  10.4   2     9    -0.34
6    30/03/2011 03:22     0    597   30/03/2011 03:55  10.4   2     9    -0.34

我已经搜索了这个问题的答案,但一直无法解决,而且我的 R 能力不高.如果有人能给我一个方向或提供解决方案,我将不胜感激.

I have searched for an answer to this problem but have not been able to solve it and my R abilities are not advanced. If anyone could give me a direction or provide a solution, I would be extremely grateful.

推荐答案

在首先将日期时间字符串转换为 POSIXt 类后,rounding 和 的某种组合>截断这些时间应该会让你得到一些你可以用作合并基础的东西.

After first converting your datetime character strings to POSIXt classes, some combination of rounding and truncating those times should get you something you can use as the basis of a merge.

首先读入您的数据,并创建相应的 POSIXt 日期时间:

First read in your data, and create corresponding POSIXt datetimes:

dts1 <- structure(list(datetime = structure(1:6,
   .Label = c("30/03/2011 02:32", "30/03/2011 02:42",
   "30/03/2011 02:52", "30/03/2011 03:02", "30/03/2011 03:12",
   "30/03/2011 03:22"), class = "factor"), count = c(27L, 3L,
   0L, 1L, 15L, 0L), period = c(561L, 600L, 574L, 550L, 600L,
   597L)), .Names = c("datetime", "count", "period"),
   class = "data.frame", row.names = c(NA, -6L))
dts2 <- structure(list(datetime = structure(1:7,
    .Label = c("30/03/2011 01:59", "30/03/2011 02:58",
    "30/03/2011 03:55", "30/03/2011 04:53", "30/03/2011 05:52",
    "30/03/2011 06:48", "30/03/2011 07:48"), class = "factor"),
    dist = c(23.9, 14.7, 10.4, 35.4, 56.1, 12.3, 10.7), car =
    c(1L, 1L, 2L, 1L, 1L, 1L, 1L), satd = c(3L, 7L, 9L, 3L, 7L,
    4L, 5L), alt = c(1.76, 6.36, -0.34, 3.55, -0.91, 6.58,
    4.18)), .Names = c("datetime", "dist", "car", "satd",
    "alt"), class = "data.frame", row.names = c(NA, -7L))

# create corresponding POSIXlt vector
# (you could update the 'datetime' columns in-place if you prefer)
datetime1 <- strptime(dts1$datetime, format="%d/%m/%Y %H:%M")
datetime2 <- strptime(dts2$datetime, format="%d/%m/%Y %H:%M")

以下代码在所有情况下都基于最近的小时生成合并表.在合并中,它只是在每个数据框前面添加一个带有舍入时间的列,基于它(即列号 1)进行合并,然后使用 -1 索引删除该列结束:

The following code produces a merged table based on the nearest hour in all cases. Inside the merge it's just prepending a column with the rounded times to each of your data frames, merging based on that (i.e., column number 1), then using the -1 index to remove that column at the end:

# merge based on nearest hour
merge(
    cbind(round(datetime1, "hours"), dts1),
    cbind(round(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd  alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7 6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7 6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7 6.36
4 30/03/2011 03:02     1    550 30/03/2011 02:58 14.7   1    7 6.36
5 30/03/2011 03:12    15    600 30/03/2011 02:58 14.7   1    7 6.36
6 30/03/2011 03:22     0    597 30/03/2011 02:58 14.7   1    7 6.36

同上,但这次只是截断了小时:

As above, but this time just truncating on hour:

merge(
    cbind(trunc(datetime1, "hours"), dts1),
    cbind(trunc(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd   alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7  6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7  6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7  6.36
4 30/03/2011 03:02     1    550 30/03/2011 03:55 10.4   2    9 -0.34
5 30/03/2011 03:12    15    600 30/03/2011 03:55 10.4   2    9 -0.34
6 30/03/2011 03:22     0    597 30/03/2011 03:55 10.4   2    9 -0.34

同上,但对于 dts1,通过在截断前减去 10*60 秒,将记录视为属于前一小时,直到一小时后 10 分钟.这个产生与您指定的相同的输出,但没有更多信息我不确定它是您想要的确切规则.

As above, but for dts1 treat the record as belonging to previous hour until 10 minutes past the hour, by subtracting 10*60 seconds before truncating. This one produces the same output you specified, though without more information I'm not sure that it's the exact rule you want.

merge(
    cbind(trunc(datetime1 - 10*60, "hours"), dts1),
    cbind(trunc(datetime2, "hours"), dts2),
    by=1, suffixes=c("_dts1", "_dts2")
)[-1]

     datetime_dts1 count period    datetime_dts2 dist car satd   alt
1 30/03/2011 02:32    27    561 30/03/2011 02:58 14.7   1    7  6.36
2 30/03/2011 02:42     3    600 30/03/2011 02:58 14.7   1    7  6.36
3 30/03/2011 02:52     0    574 30/03/2011 02:58 14.7   1    7  6.36
4 30/03/2011 03:02     1    550 30/03/2011 02:58 14.7   1    7  6.36
5 30/03/2011 03:12    15    600 30/03/2011 03:55 10.4   2    9 -0.34
6 30/03/2011 03:22     0    597 30/03/2011 03:55 10.4   2    9 -0.34

您可以根据具体规则调整舍入哪些、截断哪些以及是否先减去/增加一些时间的细节.

You could tweak the details of which ones you round, which ones you truncate, and whether you first subtract/add some time depending on your specific rule.

不是最优雅的,但这里有一种不同的方法,可以适应您在评论中描述的更复杂的条件规则.这在很大程度上依赖于 zoo 包中的 na.locf 来首先确定每个 dts1 记录之前和之后的 dts2 时间.有了这些,只需应用规则选择所需的 dts2 时间,匹配回原始 dts1 表,然后合并即可.

Not the most elegant, but here is a different approach that accommodates the more complicated conditional rule you described in your comments. This leans heavily on na.locf from the zoo package to first determine which dts2 times come before and after each dts1 record. With those in hand, it's just a matter of applying the rule to select the desired dts2 time, matching back to the original dts1 table, then merging.

library(zoo)

# create ordered list of all datetimes, using names to keep
# track of which ones come from each data frame
alldts <- sort(c(
    setNames(datetime1, rep("dts1", length(datetime1))),
    setNames(datetime2, rep("dts2", length(datetime2)))))
is.dts1 <- names(alldts)=="dts1"

# for each dts1 record, get previous closest dts2 time
dts2.prev <- alldts
dts2.prev[is.dts1] <- NA
dts2.prev <- na.locf(dts2.prev, na.rm=FALSE)[is.dts1]

# for each dts1 record, get next closest dts2 time
dts2.next <- alldts
dts2.next[is.dts1] <- NA
dts2.next <- na.locf(dts2.next, na.rm=FALSE, fromLast=TRUE)[is.dts1]

# for each dts1 record, apply rule to choose dts2 time
use.prev <- !is.na(dts2.prev) & (alldts[is.dts1] - dts2.prev < 5)
dts2.to.use <- ifelse(use.prev, as.character(dts2.prev), 
    as.character(dts2.next))

# merge based on chosen dts2 times, prepended as character vector
# for the purpose of merging
merge(
    cbind(.dt=dts2.to.use[match(datetime1, alldts[is.dts1])], dts1),
    cbind(.dt=as.character(datetime2), dts2),
    by=".dt", all.x=TRUE, suffixes=c("_dts1", "_dts2")
)[-1]

这篇关于如何通过日期/时间值不匹配的公共列合并 r 中的两个数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆