R:基于时间戳间隔的条件连接 [英] R: conditional join based on timestamp interval
本文介绍了R:基于时间戳间隔的条件连接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有两个数据。帧:
# limits
ID Start_1 End_1 Start_2 End_2
1 A 2013-04-23 2013-06-09 2013-04-26 2017-02-06
2 B 2013-05-12 2013-08-08 2013-04-26 2017-02-06
3 C 2013-04-24 2013-04-26 2017-02-05 2017-02-08
和
# df (header shown)
Timestamp ID
1 2013-04-24 14:01:21 A
2 2013-04-24 14:01:46 B
3 2013-04-24 14:01:50 C
4 2013-04-25 00:02:19 A
5 2013-04-25 02:02:48 B
6 2013-04-25 04:02:04 C
我想在数据中填充Pop
列。Framedf
基于其观测的时间戳(df$Timestamp
):如果df$Timestamp
介于两个时间限制之间(存储在data.Frame,limits
:limits$Start_1
和limits$End_1
中),则Pop
列填充为‘Yes’,否则填充为‘No’。
如果df$Timestamp
介于两个下一个时间限制(limits$Start_2
和limits$End_2
)之间,则Pop
列将使用‘可能’填充,覆盖任何‘是’或‘否’。
设置如下:
# main data.frame
df<-structure(list(Timestamp = structure(c(1366826481, 1366826506,
1366826510, 1366862539, 1366869768, 1366876924, 1366948927, 1366948963,
1367013725, 1367107304, 1367107308, 1367107316, 1486342833, 1486350011,
1486350026, 1486429233, 1486436435, 1486436459, 1486515633, 1486522816,
1486522834, 1486530052, 1486537217, 1486537251),
class = c("POSIXct","POSIXt"), tzone = ""),
ID = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,1L, 2L, 3L),
.Label = c("A", "B", "C"), class = "factor")),
.Names = c("Timestamp", "ID"), row.names = c(NA, -24L), class = "data.frame")
# data.frame with time limits
limits<- structure(list(ID = structure(1:3, .Label = c("A", "B", "C"), class = "factor"),
Start_1 = structure(c(1366689600, 1368331200, 1366776000), class = c("POSIXct","POSIXt"), tzone = ""),
End_1 = structure(c(1370750400, 1375934400,1366948800), class = c("POSIXct", "POSIXt"), tzone = ""),
Start_2 = structure(c(1366948800, 1366948800, 1486270800), class = c("POSIXct", "POSIXt"), tzone = ""),
End_2 = structure(c(1486357200, 1486357200, 1486530000), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("ID","Start_1", "End_1", "Start_2", "End_2"),
row.names = c(NA,-3L), class = "data.frame")
我有两个看似有效的方法,但它们很麻烦,而且对于现实生活中的数据集(同一个ID
有数千行、数百个ID
和$End
周期),很难"信任"它们的工作。
###### Method 1 ######
df1<-df
df1<-left_join(df1, limits, by="ID")
df1$Pop<-ifelse(df1$Timestamp>as.POSIXct(df1$Start_1) &
df1$Timestamp<as.POSIXct(df1$End_1), "Yes", "No")
df1$Pop<-ifelse(df1$Timestamp>as.POSIXct(df1$Start_2) &
df1$Timestamp<as.POSIXct(df1$End_2), "Maybe", df1$Pop)
df1$Pop<-as.factor(df1$Pop)
df1<-df1[,-c(3,6)]
###### Method 2 ######
df2<-df
df2<-df2[with(df2, order(ID, Timestamp)), ]
ids<-as.factor(levels(droplevels(df2$ID)))
tmp<-NULL
for(i in 1:length(ids)) {
tmp[[i]]<-ifelse(df2$Timestamp[which(df2$ID==ids[i])]> as.POSIXct(limits$Start_1[i]) &
df2$Timestamp[which(df2$ID==ids[i])]< as.POSIXct(limits$End_1[i]), "Yes", "No") }
tmp<-data.frame(Pop = unlist(tmp)) # tmp is a list - this turns it into a data-frame
df2<-cbind(df2,tmp)
# add 'Maybe'
tmp2<-NULL
for(i in 1:length(ids)) {
tmp2[[i]]<-df2$Timestamp[which(df2$ID==ids[i])]> as.POSIXct(limits$Start_2[i]) &
df2$Timestamp[which(df2$ID==ids[i])]< as.POSIXct(limits$End_2[i]) }
tmp2<-data.frame(Pop = unlist(tmp2))
df2$Pop<-as.character(df2$Pop)
df2$Pop[which(tmp2$Pop==TRUE)]<-'Maybe'
df2$Pop<-as.factor(df2$Pop)
df2<-df2[with(df2, order(Timestamp)), ]
有没有更好的方式(使用函数、包)进行这种联接?
编辑:
在方法1的ifelse()
中,我使用了limits$Start_1
、limits$End_1
等,而不是df1$Start1
、df1$End_1
等。
推荐答案
ygs,这里有一个data.table解决方案(使用上面的设置代码,但在data.Frame周围使用"as.data.table()")。从您的解决方案中还假设您确实想要"也许"覆盖是/否的答案。
library(data.table)
#Set keys on ID's for join
setkey(df, ID)
setkey(limits, ID)
#Join the data.tables on ID
DT <- df[limits]
#Create "pop" column and chain to desired columns from user output
DT <- DT[, ':=' (Pop = ifelse(Timestamp > Start_2 & Timestamp < End_2, "Maybe",
ifelse(Timestamp > Start_1 & Timestamp < End_1, "Yes","No")))][, c(1,2,5,6,7)]
更新:更优雅:
#Create "pop" column
DT <- df[limits, ':=' (Pop = ifelse(Timestamp > Start_2 & Timestamp < End_2, "Maybe",
ifelse(Timestamp > Start_1 & Timestamp < End_1, "Yes","No"))), by = .EACHI, on = "ID"]
这篇关于R:基于时间戳间隔的条件连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文