R数据表滚动连接“多”不能按预期工作 [英] R data.table rolling join "mult" not working as expected
问题描述
我有两个data.frames,每个都有一个时间序列。我的目标是使用df2的时间序列标记在df1最接近的时间戳。 df2中的每个时间戳应该只在df1中标记一个时间戳!
I have two data.frames each with a time series. My goal ist to use the timeseries of df2 to mark the closest timestamp in df1. Each timestamp in df2 should only mark ONE timestamp in df1!
dates1 <- as.POSIXct(c("2015-10-26 12:00:00","2015-10-26 13:00:00","2015-10-26 14:00:00"))
values1 <- c("a","b","c")
dates2 <- as.POSIXct(c("2015-10-26 12:05:00","2015-10-26 13:55:00"))
values2 <- c("A","C")
df1 <- data.frame(dates1, values1)
df2 <- data.frame(dates2, values2)
所需结果:
dates2 values2 values1
1: 2015-10-26 12:00:00 A a
2: 2015-10-26 13:00:00 NA b
3: 2015-10-26 14:00:00 C c
为了实现这一点,我将data.frames转换为data.tables和使用滚动的最近像这样:
In order to achieve this, I'm converting the data.frames to data.tables and using the rolling joing "nearest" like so:
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1,"dates1")
setkey(dt2,"dates2")
dt3 <- dt2[dt1,roll = "nearest"]
dates2 values2 values1
1: 2015-10-26 12:00:00 A a
2: 2015-10-26 13:00:00 A b
3: 2015-10-26 14:00:00 C c
values2A使用两次,一次用于12:00时间戳,一次用于13:00。我想要每个值2只使用一次,并咨询data.table手册我希望使用选项 mult =第一解决这个问题。
The values2 "A" is used twice, once for the 12:00 timestamp and once for 13:00. I want each value2 to be used only once, and consulting the data.table manual I would expect to solve this problem with using the option mult = "first".
dt3 <- dt2[dt1,roll = "nearest", mult = "first"]
这将产生相同的输出,A被使用两次。我的错误在哪里?
This results in the same output, "A" is used twice. Where is my mistake?
推荐答案
运行 When running dt2 [dt1,roll =nearest]
你根据 dt1 中每行最近的连接返回
code> df2
/ code>使用键。dt2[dt1, roll = "nearest"]
you are basically saying "return the rows from df2
according to the nearest join to each row in dt1
using the key. So
dates2
一中一行的最近日期1
在一行中的 dt1
dates2
dt2
最接近 dates1
在两个中的 dt1
日期2
c1 c1> c1 c c c $
dates2
in row one indt2
is the nearest todates1
in row one indt1
dates2
in row one indt2
is the nearest todates1
in row two indt1
dates2
in row two indt2
is the nearest todates1
in row three indt1
因此,
dt2[dt1, roll = "nearest"]
# dates2 values2 values1
# 1: 2015-10-26 12:00:00 A a
# 2: 2015-10-26 13:00:00 A b
# 3: 2015-10-26 14:00:00 C c
这是全部 dt1
与加入 values2
从 dt2
。
即根据每个最近的连接提取 values2
从 dt2
在 dt2
中使用键更新 dt1
中的行
Instead, we want to join the other way around, namely "extract values2
from dt2
according to the nearest join by each row in dt2
using the key and update the matched rows in dt1
", namely
dt1[dt2, roll = "nearest", values2 := i.values2]
dt1
# dates1 values1 values2
# 1: 2015-10-26 12:00:00 a A
# 2: 2015-10-26 13:00:00 b NA
# 3: 2015-10-26 14:00:00 c C
一些其他注意事项
Some additional notes
- 您不需要先包装
data.frame
然后到data.table
,你可以只做dt1< - data.table(dates1,values1)
参数data.table
,即dt1
等。 - 或者你可以跳过设置键,使用
改为(V 1.9.6+),即
dt1 [dt2 ,roll =nearest,values2:= i.values2,on = c(dates1 =dates2)]
- 最后,例如,代替
< -
和data.table(df)
使用=
和setDT(df)
,请参阅此处了解详情
- You don't need to wrap first to
data.frame
and then todata.table
, you can just dodt1 <- data.table(dates1, values1)
and etc. - While you at it, you can already set the key on the fly using
key
parameterdata.table
, namelydt1 <- data.table(dates1, values1, key = "dates1")
and etc. - Or you can skip setting keys all together and use
on
instead (V 1.9.6+), namelydt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
- Finally, please refrain from making unnecessary copies, e.g., instead of
<-
anddata.table(df)
use:=
andsetDT(df)
, see here for more information
这篇关于R数据表滚动连接“多”不能按预期工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!