识别时间序列中的中断并为 R 中的每个中断分配唯一因子 [英] Identfy breaks in time series and assign unique factor for each break in R

查看:22
本文介绍了识别时间序列中的中断并为 R 中的每个中断分配唯一因子的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有每数百艘船只的大量船只跟踪时间序列数据.时间序列跨越多年,因此每艘船都有多个航迹.每个轨迹"都是每小时数据,时间序列中有很大的间隔(> 天),我希望用它来识别每艘船的每个单独的轨迹.

I've got a large df of vessel track time series data per hundreds of vessels. The time series is over multiple years, therefore each vessel has multiple tracks. Each 'track' is hourly data, and there are large gaps (> days) in the time series that I'm hoping to use to identify each individual track per vessel.

我的计划是使用循环首先选择船只及其整个时间序列,然后识别每艘船只的唯一轨迹,然后将每个选定船只的单个轨迹拆分(到列表中),然后进行一些数学运算,未拆分,并附加到所有船只的新数据帧.我不知道如何为 split() 的每个识别的轨道提供一个独特的因素.一些简化的数据是;

My plan is to use a loop to first select the vessel and its whole time series, then identify unique tracks per vessel, then split (into a list) the individual tracks per selected vessel, then do some math, the unsplit, and append to a new data frame of all vessels. I can't work out how to give a unique factor to each identified track for split(). Some simplified data is;

vessel<-c(rep("A",11))
time <- as.POSIXct(c("2017-01-01 00:02:25 GMT", "2017-01-01 01:31:26 GMT", "2017-01-01 02:37:42 GMT",
                     "2017-01-01 03:14:34 GMT", "2017-01-01 04:09:45 GMT", "2017-02-01 05:51:53 GMT",
                     "2017-03-01 06:22:24 GMT", "2017-03-01 07:34:44 GMT","2017-03-01 08:01:15 GMT",
                     "2017-03-01 09:16:44 GMT", "2017-03-01 10:48:12 GMT")) 

df<-data.frame(vessel,time)

您会看到我添加了一个不属于轨道的时间(第 6 行) - 数据中充斥着这些不属于任何轨道的单个 ping.我也想知道如何处理这些事件并删除它们.到目前为止,我在其他帖子中找到的代码看起来像这样;

You'll see that I've added a single time (row 6) that isn't part of a track -the data is riddled with these single pings that are not part of any track. I'd also like to know how to deal with those occurrences and delete them some how. My code that I've picked up in other posts looks like this so far;

df$gap <- c(0, with(df, time[-1] - time[-nrow(df)]))/60 # results in hours between rows
gap_threshold <- 10 # anything greater that 10 hours difference I treat as a different track
df$over_thresh <- df$gap < gap_threshold
df

这标识了中断的位置,但是我如何为我可以在 split(df, df$split_factor) 中使用的每个中断分配一个唯一的因子?理想情况下,最终 df 应该看起来像,但我不知道如何创建列split_factor"?

This identifies where the breaks are, but how do I then assign a unique factor to each break that I can use in split(df, df$split_factor)? Ideally, final df should look something like, but I don't know how to create the column 'split_factor'?

vessel                time         gap       over_thresh  split_factor
1       A 2017-01-01 00:02:25   0.0000000        TRUE      split_1
2       A 2017-01-01 01:31:26   1.4836111        TRUE      split_1
3       A 2017-01-01 02:37:42   1.1044444        TRUE      split_1
4       A 2017-01-01 03:14:34   0.6144444        TRUE      split_1
5       A 2017-01-01 04:09:45   0.9197222        TRUE      split_1
6       A 2017-02-01 05:51:53 745.7022222       FALSE       delete
7       A 2017-03-01 06:22:24 672.5086111       FALSE      split_2
8       A 2017-03-01 07:34:44   1.2055556        TRUE      split_2
9       A 2017-03-01 08:01:15   0.4419444        TRUE      split_2
10      A 2017-03-01 09:16:44   1.2580556        TRUE      split_2
11      A 2017-03-01 10:48:12   1.5244444        TRUE      split_2
> 

第二首曲目从第 7 行开始,但由于与前一行的时间差异,它被标识为 FALSE.但是,它需要被标记为下一曲目的一部分.

The second track starts at row 7 but its been identified as FALSE because of the difference in time from the previous row. However, it needs to be labeled as part of the next track.

而且,这一切都是通过空间数据框完成的,所以我认为这可以完成,但我可能错了.我可以提取数据,并重新创建空间数据框,没问题.谢谢.

And also, this is all being done with spatial data frames, so I assume this can be done but I could be wrong on that. I can extract data, and re-create the spatial data frame, no problem. Thanks.

推荐答案

这里是 data.table 的一个选项.基于'over_thresh'创建一个带有rleid的分组索引,按'vessel'分组,然后创建'split_factor'作为带有'delete'字符串的列.获取按 'vessel'、'grp' 分组的 'over_thresh' 中存在 any TRUE 元素的行的索引 (.I),在 i 中使用它,获取组索引 (.GRP) 并粘贴子字符串 split 以分配 i<中的行元素/code> 用于split_factor"

Here, is one option with data.table. Create a grouping index with rleid based on the 'over_thresh', grouped by 'vessel', then create the 'split_factor' as a column with 'delete' string. Get the index (.I) of rows where there are any TRUE elements in 'over_thresh' grouped by 'vessel', 'grp', use that in i, get the group index (.GRP) and paste the substring split to assign the row elements in i for 'split_factor'

library(data.table)
setDT(df)[, grp := rleid(over_thresh|shift(over_thresh, type = 'lead')), vessel]
df[, split_factor := 'delete']
i1 <- df[, .I[any(over_thresh)], .(vessel, grp)]$V1
df[i1, split_factor := paste0('split_', .GRP), .(vessel, grp)][, grp := NULL][]
#     vessel                time         gap over_thresh split_factor
# 1:      A 2017-01-01 00:02:25   0.0000000        TRUE      split_1
# 2:      A 2017-01-01 01:31:26   1.4836111        TRUE      split_1
# 3:      A 2017-01-01 02:37:42   1.1044444        TRUE      split_1
# 4:      A 2017-01-01 03:14:34   0.6144444        TRUE      split_1
# 5:      A 2017-01-01 04:09:45   0.9197222        TRUE      split_1
# 6:      A 2017-02-01 05:51:53 745.7022222       FALSE       delete
# 7:      A 2017-03-01 06:22:24 672.5086111       FALSE      split_2
# 8:      A 2017-03-01 07:34:44   1.2055556        TRUE      split_2
# 9:      A 2017-03-01 08:01:15   0.4419444        TRUE      split_2
#10:      A 2017-03-01 09:16:44   1.2580556        TRUE      split_2
#11:      A 2017-03-01 10:48:12   1.5244444        TRUE      split_2


或使用 dplyrrle,按 'vessel' 分组后,在 'over_thresh' OR (|) 'over_thresh' 的 lead(即下一个值),它返回 lengthsvalues 的 list.现在,我们通过用split_"分配值"序列来对值"(这是logical)进行操作,然后更改那些FALSE 为删除"


Or using dplyr and rle, after grouping by 'vessel', use the rle on the 'over_thresh' OR (|) the lead (i.e. the next value) of 'over_thresh' which return a list of lengths and values. Now, we make a manipulation of the 'values' (which is logical) by assigning the ones that are TRUE with 'split_', sequence of 'values', then change the ones that are FALSE to 'delete'

library(dplyr)
library(stringr)
df %>% 
     group_by(vessel) %>% 
     mutate(split_factor = inverse.rle(within.list(rle(over_thresh|
            lead(over_thresh)),
          values[values] <- str_c('split_', seq_along(values[values])))), 
          split_factor = replace(split_factor, 
             !as.logical(split_factor), 'delete'))
# A tibble: 11 x 5
# Groups:   vessel [1]
#   vessel time                    gap over_thresh split_factor
#   <chr>  <dttm>                <dbl> <lgl>       <chr>       
# 1 A      2017-01-01 00:02:25   0     TRUE        split_1     
# 2 A      2017-01-01 01:31:26   1.48  TRUE        split_1     
# 3 A      2017-01-01 02:37:42   1.10  TRUE        split_1     
# 4 A      2017-01-01 03:14:34   0.614 TRUE        split_1     
# 5 A      2017-01-01 04:09:45   0.920 TRUE        split_1     
# 6 A      2017-02-01 05:51:53 746.    FALSE       delete      
# 7 A      2017-03-01 06:22:24 673.    FALSE       split_2     
# 8 A      2017-03-01 07:34:44   1.21  TRUE        split_2     
# 9 A      2017-03-01 08:01:15   0.442 TRUE        split_2     
#10 A      2017-03-01 09:16:44   1.26  TRUE        split_2     
#11 A      2017-03-01 10:48:12   1.52  TRUE        split_2 

这篇关于识别时间序列中的中断并为 R 中的每个中断分配唯一因子的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆