按ID将数据匹配到最接近的时间值 [英] Match data to nearest time value by id
问题描述
我生成了一系列每小时的时间戳,
I have generated a series of hourly time stamps with:
intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by="hour")
> intervals
[1] "2018-01-20 00:00:00 PST" "2018-01-20 01:00:00 PST" "2018-01-20 02:00:00 PST"
[4] "2018-01-20 03:00:00 PST"
使用杂乱且间隔不均的时间戳,如何将数据集中的时间值与最近的每小时时间戳匹配 id
,并删除两者之间的其他时间戳?例如:
Given a dataset with messy and unevenly spaced timestamps, how would one match time values from that dataset to the closest hourly timestamp by id
, and remove other timestamps in between? For example:
> test
time id amount
312 2018-01-20 00:02:14 PST 1 54.9508346
8652 2018-01-20 00:54:41 PST 2 30.5557992
13809 2018-01-20 01:19:27 PST 3 90.5459248
586 2018-01-20 00:03:35 PST 1 79.7635973
9077 2018-01-20 00:56:37 PST 2 75.5356406
21546 2018-01-20 02:25:05 PST 3 36.6017705
7275 2018-01-20 00:47:45 PST 1 12.7618139
12768 2018-01-20 01:15:30 PST 2 72.4465838
1172 2018-01-20 00:08:01 PST 3 81.0468155
24106 2018-01-20 03:04:10 PST 1 0.8615881
14464 2018-01-20 01:25:04 PST 2 49.8718743
15344 2018-01-20 01:29:30 PST 3 85.0054113
14255 2018-01-20 01:23:22 PST 1 34.5093891
21565 2018-01-20 02:25:40 PST 2 69.0175725
15602 2018-01-20 01:31:32 PST 3 61.8602426
将产生:
> output
interval id amount
1 2018-01-20 01:00:00 1 12.7618139
2 2018-01-20 1 54.9508346
3 2018-01-20 03:00:00 1 0.8615881
4 2018-01-20 01:00:00 2 75.5356400
5 2018-01-20 02:00:00 2 69.0175700
6 2018-01-20 3 81.0468200
7 2018-01-20 01:00:00 3 90.5459200
8 2018-01-20 02:00:00 3 36.6017700
我了解 data.table
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
具有 roll =最近的
,但是如何保持在间隔$中找到最近的匹配项c $ c>测试中每个
并保留 id
金额
属性?
with roll = nearest
, but how would one keep find the nearest match in intervals
for every id
in test
and retain the amount
attribute ?
任何建议将不胜感激!以下是示例数据:
Any suggestions would be appreciated! Here is the sample data:
dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST",
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST",
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST",
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST",
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST",
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332,
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574,
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382,
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297,
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L,
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L,
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")
推荐答案
另一种选择是在 j
内与 data.table联接
:
# convert 'test' to a 'data.table' first with 'setDT'
# and convert the 'time'-column tot a datetime format
setDT(test)[, time := as.POSIXct(time)][]
# preform the join
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]
给出:
id time amount
1: 1 2018-01-20 00:00:00 54.9508346
2: 1 2018-01-20 01:00:00 12.7618139
3: 1 2018-01-20 02:00:00 34.5093891
4: 1 2018-01-20 03:00:00 0.8615881
5: 2 2018-01-20 00:00:00 30.5557992
6: 2 2018-01-20 01:00:00 75.5356406
7: 2 2018-01-20 02:00:00 69.0175725
8: 2 2018-01-20 03:00:00 69.0175725
9: 3 2018-01-20 00:00:00 81.0468155
10: 3 2018-01-20 01:00:00 90.5459248
11: 3 2018-01-20 02:00:00 36.6017705
12: 3 2018-01-20 03:00:00 36.6017705
在上述方法中,某些金额
值被分配给一次以上的 通过
id
。如果您不想这样做,而只想保留最接近时间
的时间,则可以按以下方式改进方法:
In the above approach some amount
-values are assigned to more than one time
by id
. If you don't want that and only want to keep the ones which are the closest to a time
you could refine the approach as follows:
test[, r := rowid(id)
][, .SD[.(time = intervals)
, on = .(time)
, roll = 'nearest'
, .(time, amount, r, time_diff = abs(x.time - i.time))
][, .SD[which.min(time_diff)], by = r]
, by = id][, c('r','time_diff') := NULL][]
给出:
id time amount
1: 1 2018-01-20 00:00:00 54.9508346
2: 1 2018-01-20 01:00:00 12.7618139
3: 1 2018-01-20 02:00:00 34.5093891
4: 1 2018-01-20 03:00:00 0.8615881
5: 2 2018-01-20 00:00:00 30.5557992
6: 2 2018-01-20 01:00:00 75.5356406
7: 2 2018-01-20 02:00:00 69.0175725
8: 3 2018-01-20 00:00:00 81.0468155
9: 3 2018-01-20 01:00:00 90.5459248
10: 3 2018-01-20 02:00:00 36.6017705
这篇关于按ID将数据匹配到最接近的时间值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!