连接键值不完全相等的 R data.tables - 将行与最接近的时间组合 [英] Join R data.tables where key values are not exactly equal--combine rows with closest times
问题描述
是否有一种巧妙的方法可以在 R 中加入时间键值接近但不完全相同的数据表?例如,假设我有一个针对不同时间段给出的结果数据表:
Is there a slick way to join data tables in R where key values of time are close, but not exactly the same? For example, suppose I have a data table of results that are given for different time periods:
DT1 = data.table(x=rep(c("a","b","c"),each=3), time=c(10,30,60), v=1:9)
在这里,我们有一些值 (v),用于在不同时间 (time) 获取的不同类别 (x).现在假设我有来自另一个来源的数据,这些数据为不同类别提供了一些时间值:
Here, we have some values (v) for different categories (x) taken at different times (time). Now suppose that I have data from another source that provides some time values for the different categories:
DT2=data.table(x=rep(c("a","b","c"),each=1),time=c(10,10,60))
我可能有兴趣尝试将 DT2 中的时间与 DT1 尽可能接近地匹配,以预测我的 DT2 类别的值 v.我想做类似的事情
I might be interested in trying to match the times in DT2 as closely as I can to DT1 to predict a value, v, for my DT2 categories. I would like to do something like
setkeyv(DT2,c("x","time"))
merge(DT1,DT2,by=c("time","v")
返回:
time x v
1: 10 a 1
2: 10 b 4
3: 60 c 9
但如果我的时间没有相同的精度怎么办?例如:
But what if my times didn't have the same precision? For example:
DT2=data.table(x=rep(c("a","b","c"),each=1),time=c(17,54,3))
有没有办法执行类似的合并,但 DT2 的选择时间接近 DT1 的时间?也就是说,17 接近 30,54 接近 60,3 接近 10?
Is there a way to perform a similar merge but pick times of DT2 that are close to those of DT1? That is 17 would be close to 30, 54 close to 60, and 3 close to 10?
如果这个简单的例子不清楚,我将简要解释一下我遇到的更大的问题.我有一个包含列的数据表:类别、时间、输出 1、输出 2...有数百个类别与相关时间.我可能想在特定时间为所有类别提取输出 1.由于时间的采样没有明显的逻辑,有时时间会四舍五入到最接近的偶数秒;在其他情况下,时间会四舍五入到最接近的分钟甚至 10 分钟.
If this simple example isn't clear, I'll briefly explain the larger problem that I am having. I have a data table with columns: category, time, output1, output2... There are hundreds of categories with associated times. I might want to pull output 1 for all categories at a specific time. Since the times were sampled with no apparent logic, sometimes the time is rounded to the nearest even second; in other cases the times are rounded to the nearest minute or even 10 minutes.
我可以编写一个脚本以更常见的格式重写时间,但我很好奇是否有我没有见过的巧妙的 data.table 解决方案.我探索了滚动合并,但没有成功.
I could write a script to rewrite times in more common format, but I am curious if there is a slick data.table solution that I haven't seen. I have explored the rolling merge with no success.
推荐答案
另一个选项可能是 roll='nearest'
(CRAN v1.8.8 中的新功能).
Another option may be roll='nearest'
(new in v1.8.8 on CRAN).
> setkey(DT1,x,time)
> DT1
x time v
1: a 10 1
2: a 30 2
3: a 60 3
4: b 10 4
5: b 30 5
6: b 60 6
7: c 10 7
8: c 30 8
9: c 60 9
> DT2
x time
1: a 17
2: b 54
3: c 3
> DT1[DT2,roll="nearest"]
x time v
1: a 17 1
2: b 54 6
3: c 3 7
请注意,17 似乎比 30 更接近 10,因此结果在第一行.
Note that 17 appears to be closer to 10 than 30, hence the result in the first row.
如果您需要滚动到下一个观察值(下一个观察值向后进行):
If you need to roll to the next observation (next observation carried backwards) :
> DT1[DT2,roll=-Inf]
x time v
1: a 17 2
2: b 54 6
3: c 3 7
这篇关于连接键值不完全相等的 R data.tables - 将行与最接近的时间组合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!